[01:18:04] !log andrew@cloudcumin1001 trove START - Cookbook wmcs.openstack.migrate_server_to_ovs for server maps-test-2 [01:20:08] !log andrew@cloudcumin1001 trove END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server maps-test-2 [02:04:42] RESOLVED: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:17:03] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9929171 (10Zabe) not that I know of [04:54:51] 10Cloud-VPS (Debian Buster Deprecation), 10VideoCutTool: Cloud VPS "videocuttool" project Buster deprecation - https://phabricator.wikimedia.org/T367558#9929232 (10Gopavasanth) →14Duplicate dup:03T368593 [04:55:37] 10Cloud-VPS (Debian Buster Deprecation), 10VideoCutTool: Cloud VPS "videocuttool" project Buster deprecation - https://phabricator.wikimedia.org/T367558#9929233 (10Gopavasanth) 05Duplicate→03Open [04:57:17] 10Cloud-VPS (Debian Buster Deprecation), 10VideoCutTool: Cloud VPS "videocuttool" project Buster deprecation - https://phabricator.wikimedia.org/T367558#9929235 (10Gopavasanth) [05:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:59:42] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:48:30] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9929313 (10dcaro) [06:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:49:57] PROBLEM - Host cloudcephosd1006 is DOWN: PING CRITICAL - Packet loss = 100% [06:50:31] RECOVERY - Host cloudcephosd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [06:59:42] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:55:52] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [07:55:54] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [07:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [07:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:33:56] (03merge) 10aborrero: deployment: use Recreate pod replacement strategy [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/52 [08:35:15] (03update) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [08:36:08] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [08:36:10] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [08:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:36:44] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.157-20240627083410-f024d8fa [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/355 [08:40:53] (03open) 10aborrero: deployment: fold strategy under the right spec [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/53 [08:45:48] (03merge) 10aborrero: deployment: fold strategy under the right spec [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/53 [08:46:18] (03update) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [08:47:58] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.157-20240627083410-f024d8fa [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/355 [08:51:38] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [08:51:48] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [08:52:13] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [08:52:24] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [08:52:37] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.157-20240627083410-f024d8fa [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/355 (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [08:54:25] (03approved) 10dcaro: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) (owner: 10aborrero) [08:54:42] (03merge) 10aborrero: PSP: delete them [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/49 (https://phabricator.wikimedia.org/T368142) [08:56:54] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.159-20240627085452-0ae1a288 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/356 (https://phabricator.wikimedia.org/T368142) [08:57:38] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 06DBA: Prepare and check storage layer for btmwiki - https://phabricator.wikimedia.org/T368066#9929579 (10fnegri) 05In progress→03Resolved [08:57:44] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [08:57:54] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [09:04:18] 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Bring an-redacteddb1001 into service to replace clouddb1021 - https://phabricator.wikimedia.org/T365453#9929586 (10BTullis) 05Open→03Resolved [09:06:52] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9929592 (10aborrero) [09:08:24] 06cloud-services-team, 10Bitu, 06Infrastructure-Foundations, 07LDAP: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663#9929595 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'll take care of this when I'm back from sabbatical [09:14:29] (03open) 10aborrero: k8s: drop PodSecurityPolicies [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/153 (https://phabricator.wikimedia.org/T368142) [09:35:43] (03merge) 10aborrero: k8s: drop PodSecurityPolicies [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/153 (https://phabricator.wikimedia.org/T368142) [09:36:15] (03update) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [09:36:44] (03merge) 10aborrero: functional-tests: add pod-policy smoke test [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/325 [09:39:40] 10Toolforge, 07Epic: [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform - https://phabricator.wikimedia.org/T368600 (10dcaro) 03NEW [09:46:40] 10Toolforge, 07Epic: [Hypothesis] WE6.3.1 Consulting Toolforge roots/maintainers - https://phabricator.wikimedia.org/T368601 (10dcaro) 03NEW [09:46:45] 10Toolforge, 07Epic: [Hypothesis] WE6.3.1 Consulting Toolforge roots/maintainers - https://phabricator.wikimedia.org/T368601#9929726 (10dcaro) p:05Triage→03High [09:46:58] 10Toolforge, 07Epic: [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform - https://phabricator.wikimedia.org/T368600#9929727 (10dcaro) p:05Triage→03High [09:48:17] 10Toolforge, 07Epic: [Hypothesis] WE6.3.2 Create "standard" tool to measure the number of steps for a deployment - https://phabricator.wikimedia.org/T368602 (10dcaro) 03NEW [09:49:34] 10Toolforge, 07Epic: [Hypothesis] WE6.3.2 Create "standard" tool to measure the number of steps for a deployment - https://phabricator.wikimedia.org/T368602#9929740 (10dcaro) p:05Triage→03High [09:49:56] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [09:50:07] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [09:51:02] (03update) 10aborrero: maintain-kubeusers: bump to 0.0.159-20240627085452-0ae1a288 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/356 (https://phabricator.wikimedia.org/T368142) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:51:30] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.159-20240627085452-0ae1a288 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/356 (https://phabricator.wikimedia.org/T368142) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [09:59:05] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9929776 (10aborrero) [10:06:33] 10cloud-services-team (FY2023/2024-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph,osd,puppet] getting error from facter for `ceph_disks` fact - https://phabricator.wikimedia.org/T345227#9929830 (10dcaro) 05Open→03Resolved [10:06:44] (03open) 10aborrero: components: drop PSP references [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/357 (https://phabricator.wikimedia.org/T368142) [10:06:57] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240#9929827 (10dcaro) 05Open→03In progress [10:08:55] (03open) 10aborrero: deployment: drop PSP rolebinding [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T368142) [10:09:57] 10Toolforge, 07Epic: [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform - https://phabricator.wikimedia.org/T368600#9929838 (10dcaro) [10:10:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [KR] WE6.3 Introduce a sustainability scoring system for the Toolforge platform - https://phabricator.wikimedia.org/T368600#9929839 (10dcaro) [10:10:52] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [Hypothesis] WE6.3.1 Consulting Toolforge roots/maintainers - https://phabricator.wikimedia.org/T368601#9929844 (10dcaro) [10:10:58] (03open) 10aborrero: deployment: drop PSP [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/24 (https://phabricator.wikimedia.org/T368142) [10:11:08] (03update) 10aborrero: resources: drop PSP [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/48 (https://phabricator.wikimedia.org/T368142) [10:11:16] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 07Epic: [Hypothesis] WE6.3.2 Create "standard" tool to measure the number of steps for a deployment - https://phabricator.wikimedia.org/T368602#9929849 (10dcaro) [10:11:21] 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Bring an-redacteddb1001 into service to replace clouddb1021 - https://phabricator.wikimedia.org/T365453#9929847 (10Marostegui) @btullis please address T368354 when you can, otherwise we are sort of b... [10:13:27] (03open) 10aborrero: deployment: drop PSP [repos/cloud/toolforge/builds-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-admission/-/merge_requests/8 (https://phabricator.wikimedia.org/T368142) [10:15:05] (03open) 10aborrero: deployment: drop PSP [repos/cloud/toolforge/builds-builder] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/47 (https://phabricator.wikimedia.org/T368142) [10:15:29] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:16:04] (03open) 10aborrero: deployment: drop PSP reference [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/34 (https://phabricator.wikimedia.org/T368142) [10:16:47] (03open) 10aborrero: helmchart: drop PSP [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/8 (https://phabricator.wikimedia.org/T368142) [10:17:39] (03open) 10aborrero: deployment: drop PSP reference [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T368142) [10:18:28] (03open) 10aborrero: deployment: drop PSP [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/10 (https://phabricator.wikimedia.org/T368142) [10:20:18] (03open) 10aborrero: tests/fixtures: drop PSP reference [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/47 (https://phabricator.wikimedia.org/T368142) [10:20:53] (03merge) 10aborrero: resources: drop PSP [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/48 (https://phabricator.wikimedia.org/T368142) [10:22:51] (03open) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: maintain-kubeusers: bump to 0.0.160-20240627102103-cfd4ebd5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/358 (https://phabricator.wikimedia.org/T368142) [10:28:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-control-7 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:28:57] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [10:29:00] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [10:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:29:27] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [10:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:29:49] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [10:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:30:09] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [10:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:30:12] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [10:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:30:24] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [10:30:27] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [10:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:30:57] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [10:30:59] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=99) [10:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:31:24] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [10:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:31:38] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=0) [10:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:32:27] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.unset_cluster_maintenance [10:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:32:39] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.unset_cluster_maintenance (exit_code=0) [10:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:33:28] FIRING: [3x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-control-7 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:34:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-test-k8s-control-8 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:44:28] FIRING: [2x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-test-k8s-control-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:49:28] FIRING: [3x] PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-test-k8s-control-7 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:59:41] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:59:55] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:02:44] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:02:55] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:03:22] (03merge) 10aborrero: maintain-kubeusers: bump to 0.0.160-20240627102103-cfd4ebd5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/358 (https://phabricator.wikimedia.org/T368142) (owner: 10project_1317_bot_df3177307bed93c3f34e421e26c86e38) [11:13:58] 10PAWS: jupyterlab to 4.2.3 - https://phabricator.wikimedia.org/T368609 (10rook) 03NEW [11:15:07] 10PAWS: jupyterlab to 4.2.3 - https://phabricator.wikimedia.org/T368609#9930070 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/436 [11:15:20] vivian-rook opened https://github.com/toolforge/paws/pull/436 [11:23:58] RESOLVED: [3x] PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-control-7 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [11:38:37] 10PAWS: jupyterlab to 4.2.3 - https://phabricator.wikimedia.org/T368609#9930116 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/436 [11:38:46] 10PAWS: jupyterlab to 4.2.3 - https://phabricator.wikimedia.org/T368609#9930117 (10rook) 05Open→03Resolved [11:38:47] vivian-rook closed https://github.com/toolforge/paws/pull/436 [11:40:37] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9930118 (10aborrero) [11:41:48] (03merge) 10aborrero: helmchart: drop PSP [repos/cloud/toolforge/foxtrot-ldap] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/foxtrot-ldap/-/merge_requests/8 (https://phabricator.wikimedia.org/T368142) [11:42:52] (03update) 10aborrero: helpers: rework many resources creation script [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/151 [11:43:45] (03update) 10aborrero: components: drop PSP references [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/357 (https://phabricator.wikimedia.org/T368142) [11:44:18] (03merge) 10aborrero: helpers: rework many resources creation script [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/151 [11:44:58] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance toolsbeta-test-k8s-control-9 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [11:45:29] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9930125 (10aborrero) 05Open→03Resolved [11:45:39] 06cloud-services-team, 10Toolforge: [k8s,infra] track PSP migration plan - https://phabricator.wikimedia.org/T364297#9930127 (10aborrero) [11:45:55] 06cloud-services-team, 10Toolforge: [k8s,infra] track PSP migration plan - https://phabricator.wikimedia.org/T364297#9930128 (10aborrero) this has been completed. [12:04:48] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.set_cluster_in_maintenance [12:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:05:01] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.set_cluster_in_maintenance (exit_code=0) [12:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:05:59] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.unset_cluster_maintenance [12:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:06:12] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.unset_cluster_maintenance (exit_code=0) [12:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:06:19] (03PS14) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) [12:06:19] (03PS1) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [12:07:16] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.depool_and_destroy (T309789) [12:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:07:21] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [12:10:56] (03CR) 10CI reject: [V:04-1] alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 (owner: 10David Caro) [12:11:07] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [12:43:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirtlocal1001.eqiad.wmnet' [12:44:13] 06cloud-services-team, 10Data-Services, 06SRE: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9930357 (10Ladsgroup) I personally have no issue with giving root rights to people who have restricted or deployment rights in production (where they... [12:44:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirtlocal1001.eqiad.wmnet' [12:52:55] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9930385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirtlocal1001.eqiad.wmnet w... [12:56:51] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services: Cloud VPS "packaging" project Buster deprecation - https://phabricator.wikimedia.org/T367544#9930389 (10Jelto) [13:01:18] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services: Cloud VPS "packaging" project Buster deprecation - https://phabricator.wikimedia.org/T367544#9930391 (10Jelto) 05Open→03Resolved I deleted `packager02.packaging.eqiad1.wikimedia.cloud`, which was the last buster instance. So I'll resolv... [13:17:44] 10Data-Services: [toolsdb] Replica is frequently lagging behind the primary - https://phabricator.wikimedia.org/T357624#9930441 (10fnegri) We are currently using `binlog_format=ROW`. Setting `binlog_format=MIXED` (the same that is used in production) is likely to help, because it will hopefully choose the "state... [13:17:55] (03PS15) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) [13:17:55] (03PS2) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [13:20:16] (03update) 10dcaro: functional-tests: show the installed versions [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/349 [13:20:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:20:46] (03merge) 10dcaro: functional-tests: show the installed versions [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/349 [13:21:17] (03CR) 10CI reject: [V:04-1] alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 (owner: 10David Caro) [13:21:20] (03update) 10dcaro: dev: add docs on how to setup the dev env [toolforge-repos/admin-web] - 10https://gitlab.wikimedia.org/toolforge-repos/admin-web/-/merge_requests/1 [13:23:15] (03approved) 10dcaro: deployment: drop PSP rolebinding [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T368142) (owner: 10aborrero) [13:25:33] (03update) 10dcaro: deployment: drop PSP rolebinding [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T368142) (owner: 10aborrero) [13:30:52] (03PS3) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [13:33:56] (03CR) 10CI reject: [V:04-1] alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 (owner: 10David Caro) [13:35:36] (03approved) 10dcaro: deployment: drop PSP [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/24 (https://phabricator.wikimedia.org/T368142) (owner: 10aborrero) [13:35:40] (03update) 10dcaro: deployment: drop PSP [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/24 (https://phabricator.wikimedia.org/T368142) (owner: 10aborrero) [13:40:47] (03PS4) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [13:42:16] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1001'] [13:43:03] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9930543 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirtlocal1001.eqiad.wmnet with... [13:43:05] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1001'] [13:44:00] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T309789) [13:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [13:44:06] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [13:44:17] (03CR) 10CI reject: [V:04-1] alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 (owner: 10David Caro) [13:47:05] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_server_to_ovs for server toolsbeta-test-k8s-etcd-23 [13:47:09] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server toolsbeta-test-k8s-etcd-23 [13:59:19] (03PS1) 10Andrew Bogott: migrate_server_to_ovs.py: support ovs migration for localdisk flavors [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050368 [14:00:13] (03open) 10andrew: Add .localdisk flavors for etcd [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/10 [14:00:18] (03update) 10andrew: Add .localdisk flavors for etcd [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/10 [14:12:40] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9930644 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1... [14:12:59] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9930645 (10fnegri) I will try to depool clouddb1015@s4 to remove any load from it and see if replication catches up. All s4 wikireplica traffic will go... [14:13:11] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9930642 (10fnegri) > it's basically reading from disk, which degrades performance on the host. That makes sense, but I'm still not understanding what i... [14:24:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:29:56] FIRING: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:31:27] (03merge) 10andrew: Add .localdisk flavors for etcd [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/10 [14:34:19] (03CR) 10Andrew Bogott: [C:03+2] migrate_server_to_ovs.py: support ovs migration for localdisk flavors [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050368 (owner: 10Andrew Bogott) [14:36:38] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_server_to_ovs for server toolsbeta-test-k8s-etcd-23 [14:39:10] (03merge) 10aborrero: deployment: drop PSP rolebinding [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/98 (https://phabricator.wikimedia.org/T368142) [14:42:02] (03open) 10ebomani: Draft: Testing error generation for envvars-api [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/35 (https://phabricator.wikimedia.org/T366697) [14:42:48] (03open) 10aborrero: ingress-nginx: drop PSP [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/359 (https://phabricator.wikimedia.org/T368142) [14:43:38] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component ingress-nginx [14:43:49] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: builds-api: bump to 0.0.156-20240625082108-71537e14 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/346 [14:43:50] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component ingress-nginx [14:43:53] !log andrew@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server toolsbeta-test-k8s-etcd-23 [14:43:54] (03update) 10project_1317_bot_df3177307bed93c3f34e421e26c86e38: builds-api: bump to 0.0.156-20240625082108-71537e14 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/346 (https://phabricator.wikimedia.org/T368142) [14:44:33] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_server_to_ovs for server toolsbeta-test-k8s-etcd-23 [14:44:56] RESOLVED: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:45:26] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:47:27] (03update) 10aborrero: ingress-nginx: drop PSP [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/359 (https://phabricator.wikimedia.org/T368142) [14:47:28] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component ingress-nginx [14:47:39] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component ingress-nginx [14:49:21] !log andrew@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=97) for server toolsbeta-test-k8s-etcd-23 [14:49:24] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_server_to_ovs for server toolsbeta-test-k8s-etcd-23 [14:50:49] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component ingress-nginx [14:51:01] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component ingress-nginx [14:53:31] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server toolsbeta-test-k8s-etcd-23 [14:54:11] (03merge) 10aborrero: ingress-nginx: drop PSP [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/359 (https://phabricator.wikimedia.org/T368142) [14:55:11] FIRING: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:57:11] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9930821 (10aborrero) [14:57:42] (03close) 10aborrero: components: drop PSP references [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/357 (https://phabricator.wikimedia.org/T368142) [14:59:05] (03open) 10aborrero: cert-manager: drop PSP [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/360 (https://phabricator.wikimedia.org/T368142) [14:59:56] !log aborrero@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component cert-manager [15:00:08] !log aborrero@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component cert-manager [15:00:11] FIRING: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:00:18] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9930817 (10aborrero) 05Resolved→03In progress Reopening while we merge the cleanup patches. [15:00:26] RESOLVED: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:03:10] !log aborrero@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component cert-manager [15:03:20] !log aborrero@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component cert-manager [15:03:27] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9930855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1... [15:04:09] (03merge) 10aborrero: cert-manager: drop PSP [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/360 (https://phabricator.wikimedia.org/T368142) [15:07:55] 10cloud-services-team (Hardware), 05Goal: eqiad1: procure 1 additional cloudlb server - https://phabricator.wikimedia.org/T341062#9930891 (10aborrero) a:03Andrew [15:12:50] FIRING: NeutronAgentDown: Neutron neutron-linuxbridge-agent on cloudvirtlocal1001 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:16:50] FIRING: NeutronAgentDownForLong: Neutron neutron-linuxbridge-agent on cloudvirtlocal1001 has been down for more than 2h - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDownForLong [15:17:10] 06cloud-services-team: NeutronAgentDownForLong A Neutron agent has been down for more than 2h, VMs will have connectivity issues - https://phabricator.wikimedia.org/T365461#9930947 (10phaultfinder) [15:17:51] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, and 2 others: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#9930954 (10ops-monitoring-bot) Host rebooted by dcaro@cumin1002 with reason: upgraded packa... [15:21:26] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [15:21:27] !log dcaro@urcuchillay admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) (T309789) [15:21:30] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [15:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:21:32] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [15:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:22:28] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [15:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:28:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:33:20] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server tools-k8s-etcd-22 [15:34:43] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Toolforge: drop PodSecurityPolicy - https://phabricator.wikimedia.org/T368142#9931106 (10aborrero) ` aborrero@toolsbeta-test-k8s-control-7:~$ sudo helm list -n cert-manager NAME NAMESPACE REVISION UPDATED STATU... [15:35:54] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [15:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:35:59] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789 [15:36:04] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server tools-k8s-etcd-22 [15:36:47] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [15:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:37:57] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server tools-k8s-etcd-24 [15:39:07] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T309789) [15:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:40:41] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server tools-k8s-etcd-24 [15:41:44] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9931143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirtlocal1002.eqiad.wmnet w... [15:45:47] 06cloud-services-team, 10Toolforge: toolforge: make sure we cache in our repos/registries all helm charts and container images used in k8s - https://phabricator.wikimedia.org/T368630 (10aborrero) 03NEW [15:46:30] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server tools-db-3 [15:48:27] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server tools-db-3 [15:49:21] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server tools-db-1 [15:49:24] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server tools-db-1 [15:52:31] FIRING: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [16:06:09] (03open) 10andrew: Add more oddball flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/11 [16:06:40] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:07:31] RESOLVED: ToolsToolsDBReplicationMissing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [16:08:11] (03PS1) 10Andrew Bogott: migrate_server_to_ovs.py: Support more flavor mappings. [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050412 [16:09:05] (03merge) 10andrew: Add more oddball flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/11 [16:11:04] 10Cloud-VPS (Quota-requests): Request to increase catalyst project: cores and memory - https://phabricator.wikimedia.org/T368634 (10SDunlap) 03NEW [16:12:37] (03CR) 10Andrew Bogott: [C:03+2] migrate_server_to_ovs.py: Support more flavor mappings. [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050412 (owner: 10Andrew Bogott) [16:19:01] (03Merged) 10jenkins-bot: migrate_server_to_ovs.py: Support more flavor mappings. [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050412 (owner: 10Andrew Bogott) [16:21:23] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server tools-db-1 [16:22:43] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server tools-db-1 [16:23:52] (03PS5) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [16:24:26] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1002'] [16:24:44] (03PS6) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [16:25:29] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:25:54] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1002'] [16:26:34] !log andrew@cloudcumin1001 wmf-research-tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server covid-data [16:26:56] !log andrew@cloudcumin1001 wmf-research-tools END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server covid-data [16:27:34] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9931317 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirtlocal1002.eqiad.wmnet with... [16:27:51] (03PS1) 10Andrew Bogott: migrate_server_to_ovs: fix typo in flavor map [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050415 [16:28:06] (03CR) 10CI reject: [V:04-1] alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 (owner: 10David Caro) [16:28:29] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9931318 (10dcaro) @CDanis I'm reimaging another osd node, so some more load is being applied, I'm not seeing any iss... [16:30:29] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:34:09] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_server_to_ovs for server toolsbeta-test-k8s-etcd-25 [16:35:03] (03CR) 10Andrew Bogott: [C:03+2] migrate_server_to_ovs: fix typo in flavor map [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050415 (owner: 10Andrew Bogott) [16:35:29] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:36:53] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server toolsbeta-test-k8s-etcd-25 [16:37:35] !log andrew@cloudcumin1001 toolsbeta START - Cookbook wmcs.openstack.migrate_server_to_ovs for server toolsbeta-test-k8s-etcd-26 [16:37:42] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#9931382 (10fnegri) clouddb1015@s4 is still struggling to catch up, even after being depooled. {F55921832} It looks CPU-bound to me, as the replication... [16:38:22] (03Merged) 10jenkins-bot: migrate_server_to_ovs: fix typo in flavor map [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050415 (owner: 10Andrew Bogott) [16:40:19] !log andrew@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server toolsbeta-test-k8s-etcd-26 [16:40:29] FIRING: [10x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:44:51] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server tools-k8s-etcd-23 [16:45:29] FIRING: [9x] PuppetAgentNoResources: No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:49:05] 06cloud-services-team, 10Cloud-VPS, 06Data-Platform-SRE: Decom cloudvirt-wdqs servers - https://phabricator.wikimedia.org/T367770#9931426 (10bking) 05Open→03Resolved Looks like the subtask that contains the actual work is resolved, so I'm going to close this one out... [16:49:16] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server tools-k8s-etcd-23 [16:50:28] !log andrew@cloudcumin1001 wmf-research-tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server covid-data [16:50:30] RESOLVED: [5x] PuppetAgentNoResources: No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:50:32] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9931440 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirtlocal1003.eqiad.wmnet w... [16:50:38] !log andrew@cloudcumin1001 wmf-research-tools END (FAIL) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=99) for server covid-data [16:51:03] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Request to increase catalyst project: cores and memory - https://phabricator.wikimedia.org/T368634#9931443 (10bd808) +1 [16:51:25] (03CR) 10Andrew Bogott: [C:03+2] Add cookbook to migrate a database instance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048085 (owner: 10Andrew Bogott) [16:52:44] 06cloud-services-team, 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25 - https://phabricator.wikimedia.org/T316107#9931463 (10dcaro) a:03aborrero [16:53:32] !log andrew@cloudcumin1001 wmf-research-tools START - Cookbook wmcs.openstack.migrate_server_to_ovs for server covid-data [16:53:59] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 - https://phabricator.wikimedia.org/T359641#9931472 (10dcaro) a:03dcaro [16:54:09] !log taavi@cloudcumin1001 catalyst START - Cookbook wmcs.openstack.quota_increase (T368634) [16:54:12] T368634: Request to increase catalyst project: cores and memory - https://phabricator.wikimedia.org/T368634 [16:54:13] 06cloud-services-team, 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26 - https://phabricator.wikimedia.org/T327025#9931465 (10dcaro) a:03Slst2020 [16:54:17] !log taavi@cloudcumin1001 catalyst END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T368634) [16:54:22] !log andrew@cloudcumin1001 wmf-research-tools END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server covid-data [16:54:56] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Request to increase catalyst project: cores and memory - https://phabricator.wikimedia.org/T368634#9931480 (10taavi) 05Open→03Resolved a:03taavi [16:55:18] !log andrew@cloudcumin1001 wikiwho START - Cookbook wmcs.openstack.migrate_server_to_ovs for server wikiwho01 [16:55:39] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 - https://phabricator.wikimedia.org/T362867#9931474 (10dcaro) a:03Raymond_Ndibe [16:55:49] 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#9931477 (10dcaro) a:03fnegri [16:56:27] 10Toolforge: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) - https://phabricator.wikimedia.org/T362869#9931482 (10dcaro) a:03aborrero [16:57:15] !log andrew@cloudcumin1001 wikiwho END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server wikiwho01 [16:58:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1057.eqiad.wmnet' [17:00:42] RESOLVED: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:01:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1057.eqiad.wmnet' [17:02:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1065.eqiad.wmnet' [17:04:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1065.eqiad.wmnet' [17:05:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1066.eqiad.wmnet' [17:05:49] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1066.eqiad.wmnet' [17:06:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1058.eqiad.wmnet' [17:10:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1058.eqiad.wmnet' [17:10:43] (03PS3) 10Andrew Bogott: openstack api: clarify that server_show takes a name or an ID [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048084 [17:10:43] (03PS7) 10Andrew Bogott: Add cookbook to migrate a database instance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048085 [17:15:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1067.eqiad.wmnet' [17:16:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1067.eqiad.wmnet' [17:17:14] (03PS10) 10David Caro: ceph.osd.drain_node: force passing the cluster name [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990977 [17:17:14] (03PS10) 10David Caro: ceph.osd.undrain_node: fix help and default batch param [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990978 [17:17:14] (03PS11) 10David Caro: ceph: add missing cumin params [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/990979 [17:17:15] (03PS16) 10David Caro: ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) [17:17:16] (03PS7) 10David Caro: alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 [17:20:23] (03CR) 10CI reject: [V:04-1] alerts: use spicerack provided code [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1050335 (owner: 10David Caro) [17:20:35] (03CR) 10CI reject: [V:04-1] ceph: drain and undrain in chunks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1013369 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [17:21:20] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9931604 (10Andrew) [17:24:55] (03CR) 10Andrew Bogott: [C:03+2] openstack api: clarify that server_show takes a name or an ID [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048084 (owner: 10Andrew Bogott) [17:28:41] (03Merged) 10jenkins-bot: openstack api: clarify that server_show takes a name or an ID [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048084 (owner: 10Andrew Bogott) [17:28:42] (03Merged) 10jenkins-bot: Add cookbook to migrate a database instance [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1048085 (owner: 10Andrew Bogott) [17:31:59] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1003'] [17:32:47] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirtlocal1003'] [17:35:13] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9931673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirtlocal1003.eqiad.wmnet with... [17:49:20] RESOLVED: NeutronAgentDown: Neutron neutron-linuxbridge-agent on cloudvirtlocal1001 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:49:20] RESOLVED: NeutronAgentDownForLong: Neutron neutron-linuxbridge-agent on cloudvirtlocal1001 has been down for more than 2h - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDownForLong [17:49:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:58:39] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1057'] [17:59:01] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1057'] [17:59:42] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:03:50] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1058'] [18:04:12] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1058'] [18:07:01] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1066'] [18:07:23] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1066'] [18:10:46] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1065'] [18:11:08] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1065'] [18:11:25] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1067'] [18:11:47] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1067'] [18:11:58] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1065'] [18:12:15] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1065'] [18:32:21] !log andrew@cloudcumin1001 procbot START - Cookbook wmcs.openstack.migrate_project_to_ovs [18:36:29] !log andrew@cloudcumin1001 procbot END (PASS) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=0) [18:37:40] !log andrew@cloudcumin1001 superset START - Cookbook wmcs.openstack.migrate_server_to_ovs for server superset-126-2-bd7gsnmske5d-master-0 [18:38:49] !log andrew@cloudcumin1001 superset END (PASS) - Cookbook wmcs.openstack.migrate_server_to_ovs (exit_code=0) for server superset-126-2-bd7gsnmske5d-master-0 [18:39:29] !log andrew@cloudcumin1001 superset START - Cookbook wmcs.openstack.migrate_project_to_ovs [18:42:01] !log andrew@cloudcumin1001 superset END (PASS) - Cookbook wmcs.openstack.migrate_project_to_ovs (exit_code=0) [18:43:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:46:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1064.eqiad.wmnet' [18:48:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1064.eqiad.wmnet' [18:49:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1059.eqiad.wmnet' [18:49:42] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:59:42] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:01:33] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1059.eqiad.wmnet' [19:05:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1059.eqiad.wmnet' [19:05:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1059.eqiad.wmnet' [19:19:58] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Migrate eqiad1 hypervisors to Neutron OVS agent - https://phabricator.wikimedia.org/T364457#9932269 (10Andrew) [19:21:57] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:26:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:31:04] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1064'] [19:31:26] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1064'] [19:45:26] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: Migrate WMCS managed projects to g4 flavors - https://phabricator.wikimedia.org/T367723#9932406 (10Andrew) [19:46:19] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS: Migrate WMCS managed projects to g4 flavors - https://phabricator.wikimedia.org/T367723#9932419 (10Andrew) [19:48:12] 10Cloud-VPS (Quota-requests): puppet-diffs quota request - https://phabricator.wikimedia.org/T368669 (10jhathaway) 03NEW [19:48:30] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1059'] [19:48:49] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) on eqiad1, with recreate True, for hosts list: ['cloudvirt1059'] [19:49:10] 10Cloud-VPS (Debian Buster Deprecation), 06Infrastructure-Foundations, 10Puppet CI: Cloud VPS "puppet-diffs" project Buster deprecation - https://phabricator.wikimedia.org/T367547#9932434 (10jhathaway) a:03jhathaway [19:49:31] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1059'] [19:49:53] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1059'] [19:50:12] 10Cloud-VPS (Quota-requests): puppet-diffs quota request - https://phabricator.wikimedia.org/T368669#9932435 (10jhathaway) [19:50:13] 10Cloud-VPS (Debian Buster Deprecation), 06Infrastructure-Foundations, 10Puppet CI: Cloud VPS "puppet-diffs" project Buster deprecation - https://phabricator.wikimedia.org/T367547#9932436 (10jhathaway) [19:55:43] 10Cloud-VPS (Quota-requests): puppet-diffs quota request for buster migration - https://phabricator.wikimedia.org/T368669#9932451 (10jhathaway) [19:59:01] 06cloud-services-team, 10Toolforge: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bookworm - https://phabricator.wikimedia.org/T311905#9932456 (10Andrew) [20:00:13] 06cloud-services-team, 10Toolforge: Upgrade Toolforge (Elastic|Open)Search cluster to Debian Bookworm - https://phabricator.wikimedia.org/T311905#9932457 (10Andrew) a:03Andrew [20:23:56] FIRING: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:28:56] RESOLVED: SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:49:42] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:54:29] FIRING: PuppetAgentNoResources: No Puppet resources found on instance tools-opensearch-1 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:04:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:09:02] (03update) 10ebomani: Draft: Testing error generation for envvars-api [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/35 (https://phabricator.wikimedia.org/T366697) [21:15:13] 06cloud-services-team, 10Toolforge, 07Kubernetes: Migrate Toolforge Kubernetes hosts to Debian Bullseye or later - https://phabricator.wikimedia.org/T311908#9932655 (10taavi) 05Open→03Resolved a:03taavi [21:39:13] (03update) 10ebomani: Draft: Testing error generation for envvars-api [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/35 (https://phabricator.wikimedia.org/T366697) [22:49:40] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [22:50:09] !log dcaro@urcuchillay admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T309789) [22:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [22:50:14] T309789: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789