[00:08:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:13:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:19:32] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:33:44] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:38:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:53:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:54:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:01:43] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:22:06] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:56:29] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:06:29] (OpenstackAPIResponse) resolved: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:48:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:53:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:54:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:33:45] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:51:16] 10Data-Services, 10DBA, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 (10Marostegui) @BTullis this is also ready [06:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:54:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:08:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:23:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:15:27] 10Tools: is 'img-usage' tool still in use? - https://phabricator.wikimedia.org/T349912 (10taavi) 05Open→03Resolved Thanks! I've marked the tool for deletion. [08:15:30] 10Toolforge, 10cloud-services-team: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 (10taavi) [08:33:45] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:03:37] (CephSlowOps) firing: Ceph cluster in eqiad has 8 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:03:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [09:08:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 8 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:54:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:58:42] (03PS1) 10Majavah: hieradata: fix cloudinfra webproxy password location [labs/private] - 10https://gerrit.wikimedia.org/r/969689 [09:58:48] (03PS1) 10Majavah: secret: dkim: move wmcs dkim keys to correct location [labs/private] - 10https://gerrit.wikimedia.org/r/969690 [10:02:07] (03PS1) 10Majavah: hieradata: add fake metricsinfra grafana password [labs/private] - 10https://gerrit.wikimedia.org/r/969691 [10:15:10] 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) @jcrespo We are upgrading more servers to Bookworm and I would like to avoid having too many servers with manually-installed packag... [10:22:16] 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) I was planning on creating a package for bookworm soon, but I cannot provide any timeline. [10:37:46] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) [10:38:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) 05Open→03In progress [10:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:53:42] 10cloud-services-team, 10Observability-Metrics: Evaluate whether to deploy cloud Prometheus instance to codfw - https://phabricator.wikimedia.org/T350010 (10fgiunchedi) [10:55:31] 10cloud-services-team, 10Observability-Metrics: Rename prometheus/labs datasource in Grafana to prometheus/cloud - https://phabricator.wikimedia.org/T350013 (10fgiunchedi) [10:56:34] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10fgiunchedi) [10:56:50] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) [10:57:15] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics, 10User-fgiunchedi: Move labs/wmcs (OpenStack) Prometheus instance off cloudmetrics hosts to prometheus* hosts - https://phabricator.wikimedia.org/T336854 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've opened followup tasks for the rem... [11:02:18] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) As described in https://phabricator.wikimedia.org/T345810#9153935 we need to do an in-place upgrade of MariaDB to the latest version, //before// reimaging the... [12:08:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:08:45] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:23:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:54:03] (03CR) 10Jgreen: [V: 03+2 C: 03+1] Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [12:54:06] (03CR) 10Jgreen: [V: 03+2 C: 03+2] Add dummy secrets for community_civicrm [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [12:54:51] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:01:53] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) 05Open→03In progress [13:16:52] 10cloud-services-team, 10MediaWiki-Engineering: Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10Bmueller) Thanks @nskaggs! Just bring this up again once you know more about the timeline for this project, and I'll bring it to the team then! [14:01:59] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/122 upgrade cadvisor [14:02:08] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [14:02:25] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [14:02:46] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/122 upgrade cadvisor [14:02:50] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [14:03:09] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [14:05:27] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) 05Open→03Resolved [14:05:29] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi) [14:21:37] (CephSlowOps) firing: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:21:45] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [14:26:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:33:02] (03PS1) 10Majavah: alerts: remove unnecessary unique key [cloud/metricsinfra/prometheus-manager] - 10https://gerrit.wikimedia.org/r/969762 [14:33:20] (03CR) 10Majavah: [C: 03+2] alerts: remove unnecessary unique key [cloud/metricsinfra/prometheus-manager] - 10https://gerrit.wikimedia.org/r/969762 (owner: 10Majavah) [14:35:07] (03Merged) 10jenkins-bot: alerts: remove unnecessary unique key [cloud/metricsinfra/prometheus-manager] - 10https://gerrit.wikimedia.org/r/969762 (owner: 10Majavah) [14:37:31] (03PS1) 10Majavah: remove unnecessary workaround from migration [cloud/metricsinfra/prometheus-manager] - 10https://gerrit.wikimedia.org/r/969764 [14:37:42] (03CR) 10Majavah: [C: 03+2] remove unnecessary workaround from migration [cloud/metricsinfra/prometheus-manager] - 10https://gerrit.wikimedia.org/r/969764 (owner: 10Majavah) [14:38:15] (03Merged) 10jenkins-bot: remove unnecessary workaround from migration [cloud/metricsinfra/prometheus-manager] - 10https://gerrit.wikimedia.org/r/969764 (owner: 10Majavah) [14:51:06] (03PS1) 10Majavah: alertmanager: set auto-generated header [cloud/metricsinfra/prometheus-configurator] - 10https://gerrit.wikimedia.org/r/969790 [14:51:43] (03CR) 10Majavah: [C: 03+2] alertmanager: set auto-generated header [cloud/metricsinfra/prometheus-configurator] - 10https://gerrit.wikimedia.org/r/969790 (owner: 10Majavah) [14:52:43] (03Merged) 10jenkins-bot: alertmanager: set auto-generated header [cloud/metricsinfra/prometheus-configurator] - 10https://gerrit.wikimedia.org/r/969790 (owner: 10Majavah) [14:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:56:09] PROBLEM - Host cloudcephosd1023 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:43] RECOVERY - Host cloudcephosd1023 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:04:53] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) a:05dcaro→03Andrew [15:13:13] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [15:14:03] (InstanceDown) firing: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:14:37] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:20:37] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) cloudvirt-wdqs1003 has been relocated cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015 Side note, we had to use a 1 Gig connection sinc... [15:20:57] PROBLEM - Host cloudcephosd1024 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:29] RECOVERY - Host cloudcephosd1024 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:33:08] (03CR) 10Andrew Bogott: [C: 03+1] ceph: add drain/undrain host and rack cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966136 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [15:35:10] 10Data-Services, 10DBA: Experiment with InnoDB buffer pool size on clouddb1019.eqiad.wmnet - https://phabricator.wikimedia.org/T346464 (10Marostegui) I'd suggest dropping this everywhere rather than just two or 3 as I mentioned before. Otherwise it will be a bit of a mess with puppet. Right now clouddb1015 is... [15:36:15] PROBLEM - Host cloudcephosd1025 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:21] (03CR) 10Andrew Bogott: [C: 03+2] ceph: add drain/undrain host and rack cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966136 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [15:37:33] (03CR) 10Majavah: [C: 03+2] mypy: skip build directory [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966132 (owner: 10David Caro) [15:39:25] (03CR) 10Andrew Bogott: [C: 03+2] alerts: don't fail if host already downtimed or uptimed [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966133 (owner: 10David Caro) [15:40:02] (03CR) 10Andrew Bogott: [C: 03+2] openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [15:41:27] (03CR) 10Andrew Bogott: [C: 03+2] ceph: Adapt to multi-level crush tree [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966135 (https://phabricator.wikimedia.org/T331145) (owner: 10David Caro) [15:41:46] (03Merged) 10jenkins-bot: mypy: skip build directory [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966132 (owner: 10David Caro) [15:42:51] (03Merged) 10jenkins-bot: alerts: don't fail if host already downtimed or uptimed [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966133 (owner: 10David Caro) [15:42:53] RECOVERY - Host cloudcephosd1025 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [15:43:04] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1003.eqiad.wmnet with OS bookworm [15:43:48] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:43:54] (03Merged) 10jenkins-bot: openstack: don't pass the new project when creating it [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966134 (https://phabricator.wikimedia.org/T346427) (owner: 10David Caro) [15:45:10] (03Merged) 10jenkins-bot: ceph: Adapt to multi-level crush tree [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966135 (https://phabricator.wikimedia.org/T331145) (owner: 10David Caro) [15:45:20] (03Merged) 10jenkins-bot: ceph: add drain/undrain host and rack cookbooks [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/966136 (https://phabricator.wikimedia.org/T329709) (owner: 10David Caro) [15:59:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:06:30] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) Some initial tinkering suggests this may not be in reach in WMCS at the moment: Making a pv: ` apiVersion: v1 kind: PersistentVolume metadata: name: results spec: storageClassName: manual capacity: storage: 1Gi accessM... [16:07:02] 10Quarry: Move away from nfs? - https://phabricator.wikimedia.org/T349690 (10rook) [16:07:21] 10Quarry: Find somewhere else (not NFS) to store Quarry's resultsets - https://phabricator.wikimedia.org/T178520 (10rook) [16:08:45] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:17:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:17:44] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:17:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:17:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) [16:18:56] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [16:19:08] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [16:21:01] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [16:22:19] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:22:35] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [16:23:33] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:29:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (T348643) [16:29:45] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T348643) [16:29:50] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [16:30:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (T348643) [16:30:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T348643) [16:30:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (T348643) [16:30:30] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T348643) [16:37:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (T348643) [16:37:29] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T348643) [16:37:34] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [16:37:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (T348643) [16:37:37] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T348643) [16:38:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (T348643) [16:38:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T348643) [16:40:14] (03PS1) 10Andrew Bogott: ceph.py: handle the case where batch_size is None rather than 0 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969817 [16:41:02] (03PS2) 10Andrew Bogott: ceph.py: handle the case where batch_size is None rather than 0 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969817 [16:42:06] (03CR) 10Majavah: [C: 03+1] ceph.py: handle the case where batch_size is None rather than 0 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969817 (owner: 10Andrew Bogott) [16:44:58] (03CR) 10Andrew Bogott: [C: 03+2] ceph.py: handle the case where batch_size is None rather than 0 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969817 (owner: 10Andrew Bogott) [16:48:12] (03Merged) 10jenkins-bot: ceph.py: handle the case where batch_size is None rather than 0 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969817 (owner: 10Andrew Bogott) [16:51:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:52:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [16:52:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:53:08] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [16:53:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [16:54:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [16:55:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [16:55:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [16:55:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [16:56:28] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [17:02:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [17:04:03] (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:04:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [17:09:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [17:21:11] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) The first attempt of upgrading mariadb-server in cloudcontrol1007 failed because of apt pinning. I updated the commands in my previous comment to include `rm /... [17:37:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [17:47:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [17:56:58] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10WMDE-Fisch) [18:48:45] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:57:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcontrol2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:04:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:04:24] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2001 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:08:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:48:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:48:48] 10Toolforge (Toolforge iteration 02): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10nskaggs) Examples of golang medawiki-related cli: https://gitlab.wikimedia.org/repos/releng/cli + automated documentation https://www.mediawiki.org/wiki/Cli/ref/mw https://github.... [21:13:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance quarry-bastion in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [21:28:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance quarry-bastion in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [22:04:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:48:45] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:09:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown