[00:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:44:15] (03PS1) 10BryanDavis: Remove Twitter support [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973273 (https://phabricator.wikimedia.org/T343157) [00:44:17] (03PS1) 10BryanDavis: dev: Bump ib3 dependency to 0.3.0 [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973274 [00:57:41] (03CR) 10BryanDavis: [C: 03+2] Remove Twitter support [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973273 (https://phabricator.wikimedia.org/T343157) (owner: 10BryanDavis) [00:57:45] (03CR) 10BryanDavis: [C: 03+2] dev: Bump ib3 dependency to 0.3.0 [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973274 (owner: 10BryanDavis) [00:58:14] (03Merged) 10jenkins-bot: Remove Twitter support [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973273 (https://phabricator.wikimedia.org/T343157) (owner: 10BryanDavis) [00:58:17] (03Merged) 10jenkins-bot: dev: Bump ib3 dependency to 0.3.0 [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973274 (owner: 10BryanDavis) [01:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:22:55] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10Raymond_Ndibe) [03:23:21] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10Raymond_Ndibe) a:03Raymond_Ndibe [03:57:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:58:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [04:08:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [05:24:37] (CephSlowOps) firing: Ceph cluster in eqiad has 46 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [05:24:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [05:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:29:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 3 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [06:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:16:34] 10Grid-Engine-to-K8s-Migration: Migrate pearbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319961 (10Trialpears) 05Open→03Resolved It should now (finally) be resolved. [07:00:15] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:12:05] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:54:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:57:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:59:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:52:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:09:42] 10Toolforge, 10cloud-services-team: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 (10kostajh) [10:28:54] 10Cloud-VPS, 10cloud-services-team: WMCS public range diffscan - https://phabricator.wikimedia.org/T206653 (10taavi) 05Open→03Resolved [10:31:58] 10Cloud-VPS, 10cloud-services-team: SPF record for wmflabs.org defaults to ?all - https://phabricator.wikimedia.org/T309813 (10taavi) 05Open→03Resolved a:03taavi Done. [11:04:14] 10Grid-Engine-to-K8s-Migration: Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320062 (10Aklapper) @komla Who is "we" and who are you addressing? [11:45:53] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Another OOM crash last night, this one woke up @taavi who was on call: ` Nov 10 03:53:05 tools-db-1... [11:56:07] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Possibly another red herring, but looking at the logs it looks like every time after a OOM crash, som... [12:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:39:07] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) The grafana chart for `node_memory_MemAvailable_bytes` makes it look more like a memory leak, because... [12:52:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:04:56] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 1.127875e+06 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBReplicationLagIsTooHigh - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [13:26:53] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Looking at the graphs, when I [lowered innodb_buffer_pool_size](https://phabricator.wikimedia.org/T34... [13:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:44:56] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops - https://phabricator.wikimedia.org/T350943 (10fnegri) [13:49:48] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops - https://phabricator.wikimedia.org/T350943 (10fnegri) All the values in prometheus map to values in [SHOW SLAVE STATUS](https://mariadb.com/kb/en/show-replica-status/). `Last_Errno` that we're currently monitoring... [13:50:04] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) [14:02:23] 10Cloud-VPS, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) a:03taavi [14:03:26] 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10taavi) I think this is fixed now, right? [14:03:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Do not require overriding ::openstack_controllers per context - https://phabricator.wikimedia.org/T347554 (10taavi) 05Open→03Resolved [14:03:37] 10cloud-services-team (FY2023/2024-Q1), 10Epic, 10Goal: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10taavi) [14:17:56] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Marostegui) The tables crashed are a consequence and not a cause. That's normal after a crash. As I mentioned... [14:18:15] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10ABran-WMF) I've added a [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/963980 | few suggestions ]] on alerts that could be helpful. I can... [14:35:18] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10DBA: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10ABran-WMF) [14:57:51] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10Marostegui) [15:01:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:04:03] (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:09:48] (03PS1) 10EoghanGaffney: [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352 [15:11:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:13:51] (03CR) 10Jelto: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney) [15:18:29] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney) [15:23:19] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm [15:30:58] (03PS1) 10EoghanGaffney: [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353 [15:31:10] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353 (owner: 10EoghanGaffney) [15:39:06] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) > The tables crashed are a consequence and not a cause. I was just hoping they might point to some d... [15:48:57] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Marostegui) >>! In T349695#9323191, @fnegri wrote: >> The tables crashed are a consequence and not a cause. >... [15:49:51] 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) I think this has been edited to indicate a different problem, that still exists: > the systems are using different source addresses... [15:54:33] 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) Or maybe it's easier to create a new task and resolve this one :) [15:54:41] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm completed: - cloud... [16:04:56] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 1.038778e+06 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBReplicationLagIsTooHigh - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [16:06:30] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS b... [16:06:51] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Marostegui) Reminder, to be able to upgrade to 10.6, you'd need to: - Stop mariadb - Remove 10.4 package - In... [16:10:24] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS b... [16:10:32] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS b... [16:10:53] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Thanks @Marostegui, I will try that on a replica next week. I will probably combine this with the al... [16:12:02] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS b... [16:12:05] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS b... [16:14:57] 10Toolforge-standards-committee (Maintainer needed): Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm) [16:17:54] 10Toolforge-standards-committee (Maintainer needed): Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm) Following the discussion at , I've swit... [16:18:20] 10Toolforge-standards-committee (Maintainer needed), 10Commons: Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm) [16:18:40] 10Toolforge-standards-committee (Maintainer needed), 10Commons: Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm) [16:21:22] 10Toolforge-standards-committee (Maintainer needed), 10Commons: Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm) [16:38:58] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm [16:39:03] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:39:12] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm [16:39:25] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm [16:39:47] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm [16:52:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:57:40] 10Cloud-VPS, 10cloud-services-team: [openstack] cloudservices are using different source addresses for local vs. remote updates - https://phabricator.wikimedia.org/T350995 (10fnegri) [16:58:14] 10Cloud-VPS: Cloud VPS Designate setup improvements - https://phabricator.wikimedia.org/T340446 (10fnegri) [16:58:16] 10cloud-services-team (FY2023/2024-Q1), 10SRE, 10ops-eqiad: cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10fnegri) [16:59:02] 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) 05Open→03Resolved a:03fnegri I have created {T350995} for the problem that still exists, and I'm marking this task as resolved. [17:00:15] 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) [17:03:42] PROBLEM - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:10:41] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:10:44] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:11:37] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott reimage in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:18:09] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) [17:18:21] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) 05Open→03In progress [17:23:25] (NodeDown) firing: The node cloudvirt1064 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1064 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [17:23:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1064 is unreachable. This is a [17:23:31] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [17:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:33:25] (NodeDown) resolved: The node cloudvirt1064 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1064 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [17:33:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1064 is unreachable. This is a [17:33:25] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm executed with erro... [17:33:28] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm executed with erro... [17:33:46] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm [17:33:50] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm [17:51:34] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm completed: - cloud... [17:52:49] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm completed: - cloud... [17:54:19] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:54:25] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:54:59] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm completed: - cloud... [17:56:06] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) Thanks @ABran-WMF, I copied one of your suggestions to create a new "ReplicationMissing" alert. I also tweaked our ex... [17:57:07] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:57:10] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:59:01] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:59:22] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [18:03:14] 10Wikibugs: Better message than "This change is ready for review" when patch stops being WIP - https://phabricator.wikimedia.org/T350778 (10valhallasw) Hej, The last time I looked at it, the json provided by `stream-events` looked like this: ` { "author": { "name": "Merlijn van Deen", "email": "valha... [18:46:20] (ProbeDown) firing: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:47:36] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm completed: - cloud... [18:51:20] (ProbeDown) resolved: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:53:29] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [18:53:56] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [18:58:33] (SystemdUnitDown) firing: The service unit wmf_auto_restart_virtlogd.service is in failed status on host cloudvirt1065. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:08:30] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:20:20] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:53:16] PROBLEM - ensure kvm processes are running on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:54:33] (SystemdUnitDown) firing: The service unit networking.service is in failed status on host cloudvirt1064. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1064 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:54:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1064:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:55:04] 10cloud-services-team: PuppetFailure cloudvirt1064:9100 Puppet failure on cloudvirt1064:9100 - https://phabricator.wikimedia.org/T351004 (10phaultfinder) [20:04:14] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm [20:13:33] (SystemdUnitDownForLong) firing: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [20:13:38] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1065:9100 Unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been down for long. - https://phabricator.wikimedia.org/T351005 (10phaultfinder) [20:15:25] (NodeDown) firing: The node cloudvirt1062 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:15:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1063 is unreachable. This is a [20:15:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1062 is unreachable. This is a [20:15:30] (NodeDown) firing: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:15:33] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [20:15:35] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [20:18:33] (SystemdUnitDownForLong) resolved: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [20:20:25] (NodeDown) firing: The node cloudvirt1065 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:20:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1065 is unreachable. This is a [20:20:25] (NodeDown) resolved: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:20:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1063 is unreachable. This is a [20:20:29] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [20:21:25] (NodeDown) firing: The node cloudvirt1066 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1066 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:21:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1066 is unreachable. This is a [20:21:29] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [20:22:25] (NodeDown) firing: The node cloudvirt1067 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:22:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1067 is unreachable. This is a [20:22:29] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [20:24:04] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:24:17] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [20:25:13] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:25:25] (NodeDown) resolved: The node cloudvirt1065 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:25:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1065 is unreachable. This is a [20:25:25] (NodeDown) resolved: The node cloudvirt1062 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:25:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1062 is unreachable. This is a [20:25:34] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:25:43] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:26:03] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:26:17] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:26:25] (NodeDown) resolved: The node cloudvirt1066 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1066 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:26:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1066 is unreachable. This is a [20:26:59] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:27:14] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:27:25] (NodeDown) resolved: The node cloudvirt1067 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:27:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1067 is unreachable. This is a [20:27:37] (InterfaceSpeedError) firing: brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [20:27:38] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:27:42] 10cloud-services-team: InterfaceSpeedError cloudvirt1066:9100 brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T351006 (10phaultfinder) [20:28:20] RECOVERY - ensure kvm processes are running on cloudvirt1062 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:30:03] (SystemdUnitDown) resolved: The service unit wmf_auto_restart_virtlogd.service is in failed status on host cloudvirt1065. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:32:37] (InterfaceSpeedError) resolved: brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [20:51:59] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud... [20:52:28] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:00:40] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud... [21:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:40:23] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:40:53] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:42:06] RECOVERY - ensure kvm processes are running on cloudvirt1064 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:35:05] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet with OS bookworm [23:40:31] 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Andrew) [23:44:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10Andrew) fyi @fnegri, I predicted that draining cloudvirt1061 would be difficult due to mwoffliner4 but since my last attempt i did some tuning of the live-migration se...