[00:15:50] RESOLVED: TfInfraTestDestroyFailed: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:16:28] FIRING: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:28] RESOLVED: InstanceDown: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:25:55] FIRING: MaxConntrack: Max conntrack at 80.37% on cloudvirt1040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:35:55] RESOLVED: MaxConntrack: Max conntrack at 80.44% on cloudvirt1040:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:59:24] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9912335 (10Andrew) [01:04:14] (03CR) 10Andrew Bogott: "note that requires the trove service user to resize the VM, which is currently prevented by policy thanks to https://gerrit.wikimedia.org/" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048085 (owner: 10Andrew Bogott) [01:12:11] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [01:16:30] !log andrew@cloudcumin1001 quarry END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [01:18:05] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [01:18:57] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:25:15] !log andrew@cloudcumin1001 quarry END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [01:26:43] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [01:33:52] !log andrew@cloudcumin1001 quarry END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [01:42:31] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [01:46:50] !log andrew@cloudcumin1001 quarry END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [01:47:29] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [01:51:48] !log andrew@cloudcumin1001 quarry END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [02:02:27] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:07:39] !log andrew@cloudcumin1001 quarry END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [02:10:06] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:16:23] !log andrew@cloudcumin1001 quarry END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [02:20:10] (03PS6) 10Andrew Bogott: Add cookbook to migrate a database instance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048085 [02:21:15] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:21:20] !log andrew@cloudcumin1001 quarry END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [02:22:11] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:28:34] !log andrew@cloudcumin1001 quarry END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [02:36:38] !log andrew@cloudcumin1001 quarry START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:43:34] !log andrew@cloudcumin1001 quarry END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [02:46:23] !log andrew@cloudcumin1001 maps START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:50:06] !log andrew@cloudcumin1001 copypatrol START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:51:30] !log andrew@cloudcumin1001 maps END (ERROR) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=97) for server tbd [02:51:44] !log andrew@cloudcumin1001 copypatrol START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [02:58:54] !log andrew@cloudcumin1001 copypatrol END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [02:59:28] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:03:47] !log andrew@cloudcumin1001 copypatrol END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [03:07:09] !log andrew@cloudcumin1001 hoiscript START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:11:36] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [03:12:11] !log andrew@cloudcumin1001 pm20database START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:13:30] !log andrew@cloudcumin1001 hoiscript END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [03:14:46] !log andrew@cloudcumin1001 osmit START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:18:33] !log andrew@cloudcumin1001 pm20database END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [03:23:46] !log andrew@cloudcumin1001 library-upgrader START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:25:32] 10wikitech.wikimedia.org, 10Parsoid, 10Parsoid-Read-Views: Parsoid rendering error for the incidents template on Wikitech - https://phabricator.wikimedia.org/T366842#9912458 (10ABreault-WMF) From the source of `{{Last incident}}` `
{{... [03:26:18] 10wikitech.wikimedia.org, 10Parsoid, 10Parsoid-Read-Views: Parsoid rendering error for the incidents template on Wikitech - https://phabricator.wikimedia.org/T366842#9912464 (10ABreault-WMF) →14Duplicate dup:03T356718 [03:32:48] !log andrew@cloudcumin1001 osmit END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [03:33:29] !log andrew@cloudcumin1001 library-upgrader END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [03:34:41] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:36:10] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:43:11] !log andrew@cloudcumin1001 checkuser-beta-wiki END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [03:43:28] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:48:49] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [03:54:02] !log andrew@cloudcumin1001 checkuser-beta-wiki END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [03:54:59] FIRING: InterfaceSpeedError: brq7425e328-56 on cloudvirt1053:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [03:58:02] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [03:58:07] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [04:08:41] !log andrew@cloudcumin1001 checkuser-beta-wiki END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [04:14:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.migrate_server_to_ovs for server 4e612eb8-04e1-4541-941d-a05519eed60a [04:15:52] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [04:21:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server 4e612eb8-04e1-4541-941d-a05519eed60a [04:56:07] 06cloud-services-team, 10Data-Services, 10Infrastructure Security: wikireplicas root access - https://phabricator.wikimedia.org/T344599#9912511 (10Marostegui) >>! In T344599#9910036, @fnegri wrote: > I think that members of `wmcs-roots` can now circumvent this by using the `cloudcumin` hosts, and run a comma... [05:07:20] 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Bring an-redacteddb1001 into service to replace clouddb1021 - https://phabricator.wikimedia.org/T365453#9912515 (10Marostegui) @BTullis can you double check why an-redacteddb1001 isn't having check_private_data runs every day... [05:18:57] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:38:23] FIRING: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [07:43:23] RESOLVED: OOM: OOM killer active on cloudcontrol2006-dev:9100 - TODO - https://grafana.wikimedia.org/d/-OcleDKIz/oom-kill - https://alerts.wikimedia.org/?q=alertname%3DOOM [07:54:59] FIRING: InterfaceSpeedError: brq7425e328-56 on cloudvirt1053:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [08:11:45] 06cloud-services-team: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129 (10aborrero) 03NEW [08:28:23] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T368129) [08:28:29] T368129: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129 [08:31:44] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1053.eqiad.wmnet' (T368129) [08:34:58] 06cloud-services-team: InterfaceSpeedError brq7425e328-56 on cloudvirt1053:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T368105#9912725 (10aborrero) →14Duplicate dup:03T368129 [08:35:09] 06cloud-services-team: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129#9912729 (10aborrero) I drained the HV because the canary VM did not have network connectivity. [08:35:31] 06cloud-services-team: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129#9912727 (10aborrero) [08:40:20] 06cloud-services-team: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129#9912739 (10ops-monitoring-bot) Host rebooted by aborrero@cumin1002 with reason: network interface speed [08:50:50] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1053 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [09:16:36] (03update) 10sstefanova: Draft: consolidate prefixes [repos/cloud/toolforge/jobs-api] (slavina/remove-unprefixed-endpoints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/95 [09:18:57] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:26:54] (03update) 10sstefanova: Draft: consolidate prefixes [repos/cloud/toolforge/jobs-api] (slavina/remove-unprefixed-endpoints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/95 [09:29:50] (03open) 10aborrero: shell: don't hardcode path to kubectl [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/41 [09:31:09] 06cloud-services-team, 10Toolforge: [lima-kilo,jobs-api,infra] tool jobs run as root user in lima-kilo environment - https://phabricator.wikimedia.org/T346738#9912822 (10aborrero) 05Open→03Resolved a:03aborrero This has been fixed by properly introducing PSP and later Kyverno policies: `lang=shell-s... [09:37:12] 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Bring an-redacteddb1001 into service to replace clouddb1021 - https://phabricator.wikimedia.org/T365453#9912832 (10BTullis) @Marostegui - yes, I will look into it. I can see that the timer is firing and the service reports su... [09:39:54] !log aborrero@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1053'] [09:40:17] !log aborrero@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1053'] [09:41:35] (03approved) 10sstefanova: shell: don't hardcode path to kubectl [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/41 (owner: 10aborrero) [09:42:05] 06cloud-services-team: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129#9912837 (10aborrero) I assume it was some kind of misconfiguration. The server is now up and running after the reimage. [09:42:19] (03update) 10sstefanova: Draft: consolidate prefixes [repos/cloud/toolforge/jobs-api] (slavina/remove-unprefixed-endpoints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/95 [09:43:05] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance [09:43:15] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) [09:48:31] 06cloud-services-team: cloludvirt1035: InterfaceSpeedError: brq7425e328-56 - https://phabricator.wikimedia.org/T368129#9912850 (10aborrero) 05Open→03Resolved a:03aborrero In the reimage cookbook I pasted the wrong ticket ID, see: * https://phabricator.wikimedia.org/T353323#9912769 * https://phabricator... [09:49:50] (03merge) 10aborrero: shell: don't hardcode path to kubectl [repos/cloud/toolforge/tools-webservice] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/41 [09:50:46] 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9912859 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF private data has been sanitized view database has been created with the proper accounting [09:54:10] 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9912866 (10Zabe) 05Resolved→03Open I think `sre.wikireplicas.add-wiki` needs to be executed by WMCS, see https://wikitech.wikimedia.org/wiki/Add_a_wiki#Maintain_views. [09:56:48] 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9912874 (10ABran-WMF) a:05ABran-WMF→03None ah indeed, mybad [10:24:30] 06cloud-services-team, 10Toolforge: kyverno: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135 (10aborrero) 03NEW [10:26:21] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136 (10fnegri) 03NEW [10:26:51] 06cloud-services-team, 10Toolforge: kyverno: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135#9912958 (10aborrero) [10:27:50] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9912971 (10fnegri) [10:28:26] (03update) 10sstefanova: Draft: consolidate prefixes [repos/cloud/toolforge/jobs-api] (slavina/remove-unprefixed-endpoints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/95 [10:31:39] 06cloud-services-team, 10Toolforge: kyverno: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135#9912987 (10aborrero) [10:35:37] !log arturo@nostromo admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:35:43] !log arturo@nostromo admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [10:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:35:56] !log arturo@nostromo admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:35:56] !log arturo@nostromo admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=97) [10:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:36:11] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [10:36:11] !log aborrero@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [10:40:45] (03update) 10sstefanova: consolidate prefixes [repos/cloud/toolforge/jobs-api] (slavina/remove-unprefixed-endpoints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/95 [10:43:57] !log aborrero@cloudcumin1001 wikidumpparse START - Cookbook wmcs.openstack.migrate_project_to_ovs [10:44:08] !log aborrero@cloudcumin1001 wikidumpparse END (FAIL) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=1) [10:46:57] 06cloud-services-team, 10Data-Services, 10Infrastructure Security: wikireplicas root access - https://phabricator.wikimedia.org/T344599#9913002 (10fnegri) >> replication password is shared between clouddb and production hosts > This is not a super big deal, you cannot really do much with it. This concern wa... [10:47:00] (03update) 10sstefanova: consolidate prefixes [repos/cloud/toolforge/jobs-api] (slavina/remove-unprefixed-endpoints) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/95 [11:37:25] 06cloud-services-team, 10Toolforge: Toolforge: redeploy kyverno after the outage - https://phabricator.wikimedia.org/T368044#9913085 (10aborrero) 05Open→03Resolved checked a few things, both kyverno and the cluster seems happy. [11:37:28] 06cloud-services-team: haproxy: install some command line interface - https://phabricator.wikimedia.org/T367956#9913088 (10aborrero) [11:37:33] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra,alerting] improve HAproxy and k8s apiserver interaction - https://phabricator.wikimedia.org/T367389#9913089 (10aborrero) [11:37:39] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): [k8s,infra] scale up coredns replicas - https://phabricator.wikimedia.org/T333934#9913090 (10aborrero) [11:37:45] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Incident: 2024-06-12 toolforge k8s control plane - https://phabricator.wikimedia.org/T367348#9913091 (10aborrero) [11:38:15] 06cloud-services-team, 10Toolforge: [k8s,infra] track PSP migration plan - https://phabricator.wikimedia.org/T364297#9913096 (10aborrero) [11:39:34] 06cloud-services-team, 10Toolforge, 10Sustainability (Incident Followup): Incident: 2024-06-12 toolforge k8s control plane - https://phabricator.wikimedia.org/T367348#9913093 (10aborrero) 05Open→03Resolved everything done here. [11:39:35] 06cloud-services-team, 10Toolforge: toolforge: kyverno: change policies to Enforce - https://phabricator.wikimedia.org/T368141 (10aborrero) 03NEW [11:40:38] 06cloud-services-team, 10Toolforge: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142 (10aborrero) 03NEW [11:42:08] 06cloud-services-team, 10Toolforge: toolforge: kyverno: change policies to Enforce - https://phabricator.wikimedia.org/T368141#9913129 (10aborrero) 05Open→03In progress p:05Triage→03High [11:42:54] 06cloud-services-team, 10Toolforge: [k8s,infra] track PSP migration plan - https://phabricator.wikimedia.org/T364297#9913123 (10aborrero) [11:43:49] (03open) 10aborrero: kyverno_pod_policy: set validation to Enforce [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/46 (https://phabricator.wikimedia.org/T368141) [11:49:27] (03open) 10aborrero: kyverno_pod_policy: use patch operation in do_update() [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/47 (https://phabricator.wikimedia.org/T368141) [12:04:50] (03update) 10aborrero: kyverno_pod_policy: use patch operation in do_update() [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/47 (https://phabricator.wikimedia.org/T368141) [12:07:53] (03close) 10aborrero: resources: delete kyverno_pod_policy [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/44 (https://phabricator.wikimedia.org/T367952) [12:13:35] (03open) 10aborrero: resources: drop PSP [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/48 [12:13:47] (03update) 10aborrero: resources: drop PSP [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/48 (https://phabricator.wikimedia.org/T368142) [12:21:02] (03open) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [12:22:45] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9913224 (10aborrero) [12:23:15] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9913225 (10aborrero) [12:23:24] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9913226 (10aborrero) p:05Triage→03Medium [12:27:41] 06cloud-services-team, 10Toolforge: kyverno: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135#9913228 (10aborrero) [12:28:26] 06cloud-services-team, 10Toolforge: kyverno: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135#9913229 (10aborrero) [13:11:36] 06cloud-services-team, 10Data-Services, 10Infrastructure Security: wikireplicas root access - https://phabricator.wikimedia.org/T344599#9913357 (10Ladsgroup) > Neither wikiadmin nor wikiuser are replicated to clouddb* hosts - not sure which users do you have in mind? There were grants and users (wikiuser an... [13:18:57] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:31:51] FIRING: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:36:51] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:33:30] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [14:40:27] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [14:43:51] !log andrew@cloudcumin1001 maps START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [14:50:47] !log andrew@cloudcumin1001 maps END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [15:52:30] !log andrew@cloudcumin1001 huma START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [15:58:53] !log andrew@cloudcumin1001 huma END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [16:00:04] !log andrew@cloudcumin1001 wikisp START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:08:57] !log andrew@cloudcumin1001 wikisp END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [16:09:59] !log andrew@cloudcumin1001 mwoffliner START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:22:09] !log andrew@cloudcumin1001 mwoffliner END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [16:23:05] !log andrew@cloudcumin1001 discordbots START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:23:09] !log andrew@cloudcumin1001 discordbots END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [16:24:13] !log andrew@cloudcumin1001 discordbots START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:30:36] !log andrew@cloudcumin1001 discordbots END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [16:31:28] !log andrew@cloudcumin1001 mwstake START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:31:32] !log andrew@cloudcumin1001 mwstake END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [16:32:23] !log andrew@cloudcumin1001 mwstake START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:44:24] !log andrew@cloudcumin1001 mwstake END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [16:44:50] !log andrew@cloudcumin1001 adiutor START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [16:56:54] !log andrew@cloudcumin1001 adiutor END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [16:57:51] !log andrew@cloudcumin1001 spacemedia START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:06:39] !log andrew@cloudcumin1001 spacemedia END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [17:07:03] !log andrew@cloudcumin1001 DBapp START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:07:04] andrew@cloudcumin1001: Unknown project "DBapp" [17:07:06] !log andrew@cloudcumin1001 DBapp END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [17:07:06] andrew@cloudcumin1001: Unknown project "DBapp" [17:07:34] !log andrew@cloudcumin1001 DBapp START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:07:34] andrew@cloudcumin1001: Unknown project "DBapp" [17:07:36] !log andrew@cloudcumin1001 DBapp END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [17:07:36] andrew@cloudcumin1001: Unknown project "DBapp" [17:07:47] !log andrew@cloudcumin1001 glamwikidashboard START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:14:11] !log andrew@cloudcumin1001 glamwikidashboard END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [17:14:58] !log andrew@cloudcumin1001 citefix START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:18:57] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:23:22] !log andrew@cloudcumin1001 citefix END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [17:48:13] !log andrew@cloudcumin1001 citefix START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:48:19] !log andrew@cloudcumin1001 citefix END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [17:51:15] !log andrew@cloudcumin1001 citefix START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [17:52:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:57:36] !log andrew@cloudcumin1001 citefix END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [17:59:05] !log andrew@cloudcumin1001 xtools START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [18:13:37] !log andrew@cloudcumin1001 xtools END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [18:14:19] !log andrew@cloudcumin1001 linkwatcher START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [18:24:03] !log andrew@cloudcumin1001 linkwatcher END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [18:24:30] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [18:29:31] FIRING: ToolsToolsDBReplicationLagIsTooHigh: ToolsDB replication on tools-db-3 is lagging behind the primary, the current lag is 3687 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [18:33:29] !log andrew@cloudcumin1001 checkuser-beta-wiki END (ERROR) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=97) for server tbd [18:40:28] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9914204 (10CDanis) Apologies @dcaro but I had less time for this than I expected this week, was only able to do some... [18:55:55] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:04:35] !log andrew@cloudcumin1001 checkuser-beta-wiki END (ERROR) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=97) for server tbd [19:05:33] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:09:10] !log andrew@cloudcumin1001 checkuser-beta-wiki END (ERROR) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=97) for server tbd [19:10:07] !log andrew@cloudcumin1001 checkuser-beta-wiki START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:17:39] !log andrew@cloudcumin1001 checkuser-beta-wiki END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [19:19:34] !log andrew@cloudcumin1001 wmcz-stats START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:19:37] !log andrew@cloudcumin1001 wmcz-stats END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [19:19:56] !log andrew@cloudcumin1001 wmcz-stats START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:26:21] !log andrew@cloudcumin1001 wmcz-stats END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd [19:28:46] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:35:47] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [19:46:50] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:53:19] !log andrew@cloudcumin1001 reading-web-staging END (ERROR) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=97) for server tbd [19:53:24] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [19:57:22] !log andrew@cloudcumin1001 reading-web-staging END (ERROR) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=97) for server tbd [20:00:43] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [20:11:20] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [20:12:31] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [20:23:14] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [20:33:02] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [20:39:58] !log andrew@cloudcumin1001 reading-web-staging END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd [21:18:57] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:27:22] !log andrew@cloudcumin1001 reading-web-staging START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd [21:42:03] !log andrew@cloudcumin1001 reading-web-staging END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd