[00:04:03] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[00:44:15] <wikibugs>	 (03PS1) 10BryanDavis: Remove Twitter support [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973273 (https://phabricator.wikimedia.org/T343157)
[00:44:17] <wikibugs>	 (03PS1) 10BryanDavis: dev: Bump ib3 dependency to 0.3.0 [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973274
[00:57:41] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Remove Twitter support [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973273 (https://phabricator.wikimedia.org/T343157) (owner: 10BryanDavis)
[00:57:45] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] dev: Bump ib3 dependency to 0.3.0 [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973274 (owner: 10BryanDavis)
[00:58:14] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Twitter support [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973273 (https://phabricator.wikimedia.org/T343157) (owner: 10BryanDavis)
[00:58:17] <wikibugs>	 (03Merged) 10jenkins-bot: dev: Bump ib3 dependency to 0.3.0 [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/973274 (owner: 10BryanDavis)
[01:28:53] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[03:04:03] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[03:22:55] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10Raymond_Ndibe)
[03:23:21] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10Raymond_Ndibe) a:03Raymond_Ndibe
[03:57:26] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[03:58:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[04:08:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[05:24:37] <jinxer-wm>	 (CephSlowOps) firing: Ceph cluster in eqiad has 46 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[05:24:42] <wikibugs>	 10cloud-services-team: CephSlowOps  Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder)
[05:28:53] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[05:29:37] <jinxer-wm>	 (CephSlowOps) resolved: Ceph cluster in eqiad has 3 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[06:04:03] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[06:16:34] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate pearbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319961 (10Trialpears) 05Open→03Resolved It should now (finally) be resolved.
[07:00:15] <icinga-wm>	 RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:12:05] <icinga-wm>	 PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:54:59] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[07:57:27] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[07:59:59] <jinxer-wm>	 (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[08:52:27] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[09:04:03] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[09:28:53] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[10:09:42] <wikibugs>	 10Toolforge, 10cloud-services-team: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 (10kostajh)
[10:28:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: WMCS public range diffscan - https://phabricator.wikimedia.org/T206653 (10taavi) 05Open→03Resolved
[10:31:58] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: SPF record for wmflabs.org defaults to ?all - https://phabricator.wikimedia.org/T309813 (10taavi) 05Open→03Resolved a:03taavi Done.
[11:04:14] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320062 (10Aklapper) @komla Who is "we" and who are you addressing?
[11:45:53] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Another OOM crash last night, this one woke up @taavi who was on call:  ` Nov 10 03:53:05 tools-db-1...
[11:56:07] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Possibly another red herring, but looking at the logs it looks like every time after a OOM crash, som...
[12:04:03] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[12:39:07] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) The grafana chart for `node_memory_MemAvailable_bytes` makes it look more like a memory leak, because...
[12:52:27] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[13:04:56] <wmcs-alerts>	 (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 1.127875e+06 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBReplicationLagIsTooHigh  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh
[13:26:53] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Looking at the graphs, when I [lowered innodb_buffer_pool_size](https://phabricator.wikimedia.org/T34...
[13:28:53] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[13:44:56] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops - https://phabricator.wikimedia.org/T350943 (10fnegri)
[13:49:48] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops - https://phabricator.wikimedia.org/T350943 (10fnegri) All the values in prometheus map to values in [SHOW SLAVE STATUS](https://mariadb.com/kb/en/show-replica-status/). `Last_Errno` that we're currently monitoring...
[13:50:04] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri)
[14:02:23] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: cloudrabbit: connect them via cloudsw and cloud-private - https://phabricator.wikimedia.org/T345610 (10taavi) a:03taavi
[14:03:26] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10taavi) I think this is fixed now, right?
[14:03:35] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Do not require overriding ::openstack_controllers per context - https://phabricator.wikimedia.org/T347554 (10taavi) 05Open→03Resolved
[14:03:37] <wikibugs>	 10cloud-services-team (FY2023/2024-Q1), 10Epic, 10Goal: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10taavi)
[14:17:56] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Marostegui) The tables crashed are a consequence and not a cause. That's normal after a crash. As I mentioned...
[14:18:15] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1): [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10ABran-WMF) I've added a [[ https://gerrit.wikimedia.org/r/c/operations/alerts/+/963980 | few suggestions ]] on alerts that could be helpful. I can...
[14:35:18] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10DBA: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10ABran-WMF)
[14:57:51] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10Marostegui)
[15:01:03] <wmcs-alerts>	 (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[15:04:03] <wmcs-alerts>	 (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[15:09:48] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352
[15:11:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[15:13:51] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney)
[15:18:29] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney)
[15:23:19] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm
[15:30:58] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353
[15:31:10] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353 (owner: 10EoghanGaffney)
[15:39:06] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) > The tables crashed are a consequence and not a cause.  I was just hoping they might point to some d...
[15:48:57] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Marostegui) >>! In T349695#9323191, @fnegri wrote: >> The tables crashed are a consequence and not a cause. >...
[15:49:51] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) I think this has been edited to indicate a different problem, that still exists:  >  the systems are using different source addresses...
[15:54:33] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) Or maybe it's easier to create a new task and resolve this one :)
[15:54:41] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm completed: - cloud...
[16:04:56] <wmcs-alerts>	 (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 1.038778e+06 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBReplicationLagIsTooHigh  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh
[16:06:30] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS b...
[16:06:51] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Marostegui) Reminder, to be able to upgrade to 10.6, you'd need to: - Stop mariadb - Remove 10.4 package - In...
[16:10:24] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS b...
[16:10:32] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS b...
[16:10:53] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Thanks @Marostegui, I will try that on a replica next week.  I will probably combine this with the al...
[16:12:02] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS b...
[16:12:05] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS b...
[16:14:57] <wikibugs>	 10Toolforge-standards-committee (Maintainer needed): Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm)
[16:17:54] <wikibugs>	 10Toolforge-standards-committee (Maintainer needed): Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm) Following the discussion at <https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=820424623#User_talk:CommonsDelinker/commands>, I've swit...
[16:18:20] <wikibugs>	 10Toolforge-standards-committee (Maintainer needed), 10Commons: Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm)
[16:18:40] <wikibugs>	 10Toolforge-standards-committee (Maintainer needed), 10Commons: Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm)
[16:21:22] <wikibugs>	 10Toolforge-standards-committee (Maintainer needed), 10Commons: Steinsplitter's tools need a new maintainer - https://phabricator.wikimedia.org/T350953 (10Legoktm)
[16:38:58] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm
[16:39:03] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm
[16:39:12] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm
[16:39:25] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm
[16:39:47] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm
[16:52:27] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[16:57:40] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: [openstack] cloudservices are using different source addresses for local vs. remote updates - https://phabricator.wikimedia.org/T350995 (10fnegri)
[16:58:14] <wikibugs>	 10Cloud-VPS: Cloud VPS Designate setup improvements - https://phabricator.wikimedia.org/T340446 (10fnegri)
[16:58:16] <wikibugs>	 10cloud-services-team (FY2023/2024-Q1), 10SRE, 10ops-eqiad: cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10fnegri)
[16:59:02] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) 05Open→03Resolved a:03fnegri I have created {T350995} for the problem that still exists, and I'm marking this task as resolved.
[17:00:15] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10SRE: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri)
[17:03:42] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:10:41] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:10:44] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[17:11:37] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott reimage in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:18:09] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri)
[17:18:21] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) 05Open→03In progress
[17:23:25] <jinxer-wm>	 (NodeDown) firing: The node cloudvirt1064 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1064 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[17:23:25] <jinxer-wm>	 (NodeDown) firing: #page The cloudvirt node cloudvirt1064 is unreachable. This is a
[17:23:31] <wikibugs>	 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder)
[17:28:53] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[17:33:25] <jinxer-wm>	 (NodeDown) resolved: The node cloudvirt1064 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1064 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[17:33:25] <jinxer-wm>	 (NodeDown) resolved: #page The cloudvirt node cloudvirt1064 is unreachable. This is a
[17:33:25] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm executed with erro...
[17:33:28] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm executed with erro...
[17:33:46] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm
[17:33:50] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm
[17:51:34] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm completed: - cloud...
[17:52:49] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm completed: - cloud...
[17:54:19] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:54:25] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[17:54:59] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm completed: - cloud...
[17:56:06] <wikibugs>	 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) Thanks @ABran-WMF, I copied one of your suggestions to create a new "ReplicationMissing" alert. I also tweaked our ex...
[17:57:07] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:57:10] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[17:59:01] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:59:22] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99)
[18:03:14] <wikibugs>	 10Wikibugs: Better message than "This change is ready for review" when patch stops being WIP - https://phabricator.wikimedia.org/T350778 (10valhallasw) Hej,  The last time I looked at it, the json provided by `stream-events` looked like this:  ` {   "author": {     "name": "Merlijn van Deen",     "email": "valha...
[18:46:20] <wmcs-alerts>	 (ProbeDown) firing: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[18:47:36] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm completed: - cloud...
[18:51:20] <wmcs-alerts>	 (ProbeDown) resolved: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_redirects_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[18:53:29] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[18:53:56] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99)
[18:58:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit wmf_auto_restart_virtlogd.service is in failed status on host cloudvirt1065. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[19:08:30] <icinga-wm>	 RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:20:20] <icinga-wm>	 PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:53:16] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:54:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit networking.service is in failed status on host cloudvirt1064. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1064 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[19:54:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cloudvirt1064:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:55:04] <wikibugs>	 10cloud-services-team: PuppetFailure cloudvirt1064:9100 Puppet failure on cloudvirt1064:9100 - https://phabricator.wikimedia.org/T351004 (10phaultfinder)
[20:04:14] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm
[20:13:33] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[20:13:38] <wikibugs>	 10cloud-services-team: SystemdUnitDownForLong cloudvirt1065:9100 Unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been down for long. - https://phabricator.wikimedia.org/T351005 (10phaultfinder)
[20:15:25] <jinxer-wm>	 (NodeDown) firing: The node cloudvirt1062 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:15:25] <jinxer-wm>	 (NodeDown) firing: #page The cloudvirt node cloudvirt1063 is unreachable. This is a
[20:15:25] <jinxer-wm>	 (NodeDown) firing: #page The cloudvirt node cloudvirt1062 is unreachable. This is a
[20:15:30] <jinxer-wm>	 (NodeDown) firing: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:15:33] <wikibugs>	 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder)
[20:15:35] <wikibugs>	 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder)
[20:18:33] <jinxer-wm>	 (SystemdUnitDownForLong) resolved: The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[20:20:25] <jinxer-wm>	 (NodeDown) firing: The node cloudvirt1065 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:20:25] <jinxer-wm>	 (NodeDown) firing: #page The cloudvirt node cloudvirt1065 is unreachable. This is a
[20:20:25] <jinxer-wm>	 (NodeDown) resolved: The node cloudvirt1063 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:20:25] <jinxer-wm>	 (NodeDown) resolved: #page The cloudvirt node cloudvirt1063 is unreachable. This is a
[20:20:29] <wikibugs>	 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder)
[20:21:25] <jinxer-wm>	 (NodeDown) firing: The node cloudvirt1066 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1066 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:21:25] <jinxer-wm>	 (NodeDown) firing: #page The cloudvirt node cloudvirt1066 is unreachable. This is a
[20:21:29] <wikibugs>	 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder)
[20:22:25] <jinxer-wm>	 (NodeDown) firing: The node cloudvirt1067 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:22:25] <jinxer-wm>	 (NodeDown) firing: #page The cloudvirt node cloudvirt1067 is unreachable. This is a
[20:22:29] <wikibugs>	 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder)
[20:24:04] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[20:24:17] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99)
[20:25:13] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[20:25:25] <jinxer-wm>	 (NodeDown) resolved: The node cloudvirt1065 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:25:25] <jinxer-wm>	 (NodeDown) resolved: #page The cloudvirt node cloudvirt1065 is unreachable. This is a
[20:25:25] <jinxer-wm>	 (NodeDown) resolved: The node cloudvirt1062 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1062 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:25:25] <jinxer-wm>	 (NodeDown) resolved: #page The cloudvirt node cloudvirt1062 is unreachable. This is a
[20:25:34] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[20:25:43] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[20:26:03] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[20:26:17] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[20:26:25] <jinxer-wm>	 (NodeDown) resolved: The node cloudvirt1066 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1066 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:26:25] <jinxer-wm>	 (NodeDown) resolved: #page The cloudvirt node cloudvirt1066 is unreachable. This is a
[20:26:59] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[20:27:14] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[20:27:25] <jinxer-wm>	 (NodeDown) resolved: The node cloudvirt1067 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown
[20:27:25] <jinxer-wm>	 (NodeDown) resolved: #page The cloudvirt node cloudvirt1067 is unreachable. This is a
[20:27:37] <jinxer-wm>	 (InterfaceSpeedError) firing: brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError
[20:27:38] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[20:27:42] <wikibugs>	 10cloud-services-team: InterfaceSpeedError cloudvirt1066:9100 brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T351006 (10phaultfinder)
[20:28:20] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1062 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:30:03] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit wmf_auto_restart_virtlogd.service is in failed status on host cloudvirt1065. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1065 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[20:32:37] <jinxer-wm>	 (InterfaceSpeedError) resolved: brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError
[20:51:59] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud...
[20:52:28] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[21:00:40] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud...
[21:28:53] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[21:40:23] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[21:40:53] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[21:42:06] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1064 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[23:35:05] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet with OS bookworm
[23:40:31] <wikibugs>	 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Andrew)
[23:44:29] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10Andrew) fyi @fnegri, I predicted that draining cloudvirt1061 would be difficult due to mwoffliner4 but since my last attempt i did some tuning of the live-migration se...