[00:01:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:10:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:11:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:15:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:06:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:11:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:39:41] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:43:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [03:47:44] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (348643) [03:48:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [04:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [05:06:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:16:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:34:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:38:14] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [06:14:43] 10Grid-Engine-to-K8s-Migration: Migrate spi-table-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320054 (10Mz7) Acknowledging this ticket... I see it's been pending on me for a long time. I will see if I can find some time this week to get this done. [06:52:18] 10Data-Services, 10DBA: Prepare and check storage layer for zghwiki - https://phabricator.wikimedia.org/T350240 (10Marostegui) p:05Triage→03Medium Let us know when the wiki is created so we can sanitize it [06:52:35] 10Data-Services, 10DBA: Prepare and check storage layer for dgawiki - https://phabricator.wikimedia.org/T350228 (10Marostegui) p:05Triage→03Medium Let us know when the wiki is created so we can sanitize it [06:52:55] 10Data-Services, 10DBA: Prepare and check storage layer for bjnwikiquote - https://phabricator.wikimedia.org/T350234 (10Marostegui) p:05Triage→03Medium Let us know when the wiki is created so we can sanitize it [07:01:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:11:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:27:36] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [07:28:00] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [08:09:37] (CephSlowOps) firing: Ceph cluster in eqiad has 27 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:09:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [08:14:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 25 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:16:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:26:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:34:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:06:10] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Check technischewuensche tool code and publish in a public repo - https://phabricator.wikimedia.org/T350352 (10WMDE-Fisch) [10:08:01] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10WMDE-Fisch) We're taking care if it. I created a subticket and when that's done we can delete the deprecated source. Thanks again... [10:08:40] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Check technischewuensche tool code and publish in a public repo - https://phabricator.wikimedia.org/T350352 (10WMDE-Fisch) a:05Aklapper→03WMDE-Fisch [10:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:11:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:16:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:23:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 (10fnegri) We pull the Ceph packages from https://mirror.croit.io/debian-octopus but that repo only includes packages for buster and bullseye, not for bookworm. The... [10:26:57] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudcontrol1006.eqiad.wmnet with OS bookworm [10:29:19] (HAProxyBackendUnavailable) firing: (5) HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:29:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [10:29:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [10:30:19] (HAProxyServiceUnavailable) firing: HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [10:30:24] 10cloud-services-team: HAProxyServiceUnavailable cloudlb1002:9900 HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://phabricator.wikimedia.org/T350358 (10phaultfinder) [10:34:19] (HAProxyBackendUnavailable) firing: (15) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:35:19] (HAProxyServiceUnavailable) resolved: HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [10:46:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /var/lib/git/operations/puppet on instance toolsbeta-puppetmaster-04 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [10:51:03] (PuppetSyncFailure) resolved: Failed to update Puppet repository /var/lib/git/operations/puppet on instance toolsbeta-puppetmaster-04 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [11:13:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudcontrol1006.eqiad.wmnet with OS bookworm executed with errors: - clo... [11:18:33] 10Grid-Engine-to-K8s-Migration: Migrate cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319622 (10Kanashimi) May I increase the number of Kubernetes pods running at the same time? Actually, the old 16 is not enough, so I split it into 4+1 tools... [11:20:09] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) The reimage failed, logging into the mgmt interface I see this message: `No root file system is defined. Please correct this from the partitioning menu.` [11:27:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [11:33:32] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudcontrol1006.eqiad.wmnet with OS bookworm [11:34:46] 10Grid-Engine-to-K8s-Migration: Migrate cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319622 (10Kanashimi) @nskaggs @komla @Aklapper And I also need to use more memory, maybe 16*8GiB per tool... [11:41:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:44:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [11:47:24] 10Grid-Engine-to-K8s-Migration: Migrate cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319622 (10Aklapper) Please see https://phabricator.wikimedia.org/project/view/4834/ [11:51:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:13:19] 10Data-Services, 10DBA: Prepare and check storage layer for bbcwiki - https://phabricator.wikimedia.org/T350372 (10Marostegui) p:05Triage→03Medium Let us know when the wiki is created so we can sanitize it [12:14:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [12:19:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [12:19:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [12:19:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [12:20:08] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudcontrol1006.eqiad.wmnet with OS bookworm completed: - cloudcontrol10... [12:24:20] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [12:46:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:01:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:16:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:34:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:41:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:47:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [13:48:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [13:48:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:48:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [13:48:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [13:58:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:26:30] 10VPS-project-Wikistats: Add bbcwiki to wikistats - https://phabricator.wikimedia.org/T350377 (10Dzahn) a:03Dzahn [14:38:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:45:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:52:13] PROBLEM - Host cloudcephosd1029 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:07] PROBLEM - Host cloudcephosd1030 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:33] PROBLEM - Check unit status of purge_vm_backup on cloudbackup1004 is CRITICAL: CRITICAL: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:58:43] RECOVERY - Host cloudcephosd1029 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [15:01:34] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:02:41] RECOVERY - Host cloudcephosd1030 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:04:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:05:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:12:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:12:08] 10Grid-Engine-to-K8s-Migration: Migrate cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319622 (10nskaggs) >>! In T319622#9301216, @Kanashimi wrote: > @nskaggs @komla @Aklapper And I also need to use more memory, maybe 16*8GiB per tool... It would be nice to have the... [15:18:02] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) Servers have been boxed up and shipped out [15:22:22] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10Raymond_Ndibe) 05In progress→03Resolved [15:22:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:23:25] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10Raymond_Ndibe) 05Resolved→03Open [15:23:28] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) [15:24:55] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) @Andrew @cmooney dc ops is finished with our side [15:26:28] 10VPS-project-Wikistats: New wikistats interface takes minutes to load the mediawikis list - https://phabricator.wikimedia.org/T167066 (10Dzahn) I am glad to see this resolved - though I have no idea how it was solved :) [15:26:45] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/124 envvars-api: bump to 0.0.32-20231101134104-2436443d [15:30:59] 10Toolforge (Toolforge iteration 02): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10Raymond_Ndibe) 05Open→03In progress [15:41:33] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10daniel) >>! In T320021#9298181, @taavi wrote: > The `ruprecht` tool is still running on the grid engine. If it's no longer used, please stop... [16:00:55] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [16:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:12:22] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10Raymond_Ndibe) [16:17:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:21:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [16:23:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [16:23:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [16:24:34] (HAProxyBackendUnavailable) firing: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:30:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:34:20] (HAProxyBackendUnavailable) resolved: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:38:35] 10VPS-project-Wikistats, 10User-RhinosF1: remove referata table? - https://phabricator.wikimedia.org/T262148 (10Dzahn) @RhinosF1 So referata is a dead project? It still claims "temporary" technical issues but I guess that has been shown for a long time now. [16:39:27] 10VPS-project-Wikistats, 10User-RhinosF1: remove referata table? - https://phabricator.wikimedia.org/T262148 (10Dzahn) @RhinosF1 I think we have to remove puppetized systemd timers too? [16:46:51] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (348643) [16:56:33] (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [16:56:38] 10cloud-services-team: SystemdUnitDownForLong cloudbackup1004:9100 Unit purge_vm_backup.service on node cloudbackup1004 has been down for long. - https://phabricator.wikimedia.org/T350415 (10phaultfinder) [16:57:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bookworm [17:00:20] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:00:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:00:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:01:19] (HAProxyServiceUnavailable) firing: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [17:01:19] (HAProxyServiceUnavailable) firing: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [17:05:08] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Looks like the web pod had some db connection issues a little after it started. Restarting seems to have cleared it, though let's see if it comes back. ` [2023-11-01 12:20:32 +0000] [1] [INFO] Starting gunicorn 21.2.0... [17:05:19] (HAProxyBackendUnavailable) firing: (17) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:06:19] (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [17:06:19] (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [17:10:08] 10VPS-project-Wikistats, 10User-RhinosF1: remove referata table? - https://phabricator.wikimedia.org/T262148 (10RhinosF1) >>! In T262148#9302525, @Dzahn wrote: > @RhinosF1 So referata is a dead project? It still claims "temporary" technical issues but I guess that has been shown for a long time now. Temporary... [17:10:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:16:18] (HAProxyBackendUnavailable) firing: (15) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:20:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:21:19] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:32:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:34:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:35:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:40:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:45:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:45:22] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudcontrol1005.eqiad.wmnet with OS bookworm completed: - cloudcontrol10... [17:45:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:45:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:55:20] (HAProxyBackendUnavailable) resolved: (13) HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:58:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:14:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [18:38:12] 10Toolforge, 10Fix-Suggester-Bot: File system access is very slow - https://phabricator.wikimedia.org/T350432 (10kostajh) [18:39:05] 10Toolforge, 10Fix-Suggester-Bot: File system access is very slow - https://phabricator.wikimedia.org/T350432 (10kostajh) [18:47:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:56:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:01:48] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:22:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:51:26] 10VPS-project-Wikistats, 10User-RhinosF1: remove referata table? - https://phabricator.wikimedia.org/T262148 (10Dzahn) >>! In T262148#9302708, @RhinosF1 wrote: > Temporary seems to have become indefinite. ACK, thought so! thanks for confirming. >> @RhinosF1 I think we have to remove puppetized systemd timers... [20:01:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:06:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:07:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:12:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:37:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [20:56:48] (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [20:58:35] (03CR) 10Krinkle: [C: 03+2] Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852 (owner: 10Majavah) [20:58:51] (03PS2) 10Krinkle: write_config: Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852 (owner: 10Majavah) [20:58:55] (03CR) 10Krinkle: write_config: Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852 (owner: 10Majavah) [20:58:58] (03CR) 10Krinkle: [C: 03+2] write_config: Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852 (owner: 10Majavah) [21:00:10] (03Merged) 10jenkins-bot: write_config: Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852 (owner: 10Majavah) [21:21:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:27:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:34:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:58:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:07:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:15:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:01:48] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:07:55] 10VPS-project-Wikistats: Automate wikistats commands - https://phabricator.wikimedia.org/T345235 (10Dzahn) Is this really automating it though? I mean, sure, putting the commands into a script will make it a little easier but end of the day you are still manually running a (single) command and you have to react... [23:23:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:44:28] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:55:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown