[00:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [01:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [02:24:41] 10Toolforge (Software install/update): Create a kubernetes container with mono and dotnet - https://phabricator.wikimedia.org/T311466 (10Hawkeye7) Using a dotnet buildpack is fairly simple. eg.: $ pack build liftwing --buildpack paketo-buildpacks/dotnet-core --builder paketobuildpacks/builder:base --env BP_DOT... [03:22:31] 10Grid-Engine-to-K8s-Migration: Migrate ninthcircuit from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319924 (10Legoktm) 05Open→03Declined I've disabled this tool, source code is at https://gerrit.wikimedia.org/g/labs/tools/ninthcircuit if anyone wants to revive it. [03:28:04] 10Grid-Engine-to-K8s-Migration: Migrate ci from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319628 (10Legoktm) 05Open→03Resolved I've disabled the cron job for now, it was broken for other reasons. This tool did useful stuff in the past so I would like to revive it event... [03:37:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:40:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:45:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:58:06] 10Grid-Engine-to-K8s-Migration: Migrate lihaohong-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319856 (10lihaohong) 05Open→03Resolved [04:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:16:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:21:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:04:51] RECOVERY - Check unit status of backup_vms on cloudbackup1003 is OK: OK: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms [06:37:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [07:04:47] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2002 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:58:40] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) It turns out that robot accounts don't have the necessary permissions to view project quotas. This is still the case when giving a robot account all the possible permission. The user... [09:37:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:59:25] 10VPS-project-Codesearch, 10Special:NewLexeme revival, 10wmde-wikidata-tech: Please add wmde/new-lexeme-special-page to codesearch index - https://phabricator.wikimedia.org/T351938 (10Lucas_Werkmeister_WMDE) `services`, I guess? I can’t figure out which group in the codesearch UI that corresponds to, but it’... [10:02:16] 10Toolforge (Toolforge iteration 02): Give builds-api access to admin credentials - https://phabricator.wikimedia.org/T352007 (10Slst2020) [10:06:18] 10Toolforge (Toolforge iteration 02): Give builds-api access to system admin credentials - https://phabricator.wikimedia.org/T352007 (10Slst2020) [10:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:22:16] 10Tool-bub2: Redesign the FAQs page - https://phabricator.wikimedia.org/T340385 (10Aklapper) @Ed-Gah: Hi! This task has been assigned to you a while ago. Could you maybe share an update? Do you still plan to work on this task, or [do you need any help](https://www.mediawiki.org/wiki/New_Developers/Communication_... [10:56:24] 10Cloud-VPS, 10Toolforge, 10SRE: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10fnegri) 05Open→03Resolved > Hello, > > https://templatetransclusioncheck.toolforge.org/ > > https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vo... [11:01:14] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) Thanks @jcrespo -- I'm not sure who did the upgrade, but I checked in debmonitor and 0.8.3 is now installed on all cloud hosts. [11:01:28] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/140 maintain-kubeusers: bump deployment... [11:09:17] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:09:25] 10Toolforge (Quota-requests), 10Patch-For-Review: Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/140 maintain-kubeusers: bump deployment... [11:09:31] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:11:48] 10Toolforge: maintain-kubeusers occasionally crashes to a LDAP connection error - https://phabricator.wikimedia.org/T352011 (10taavi) p:05Triage→03Low [11:12:41] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) I'm afraid you don't have the latest version, https://debmonitor.wikimedia.org/packages/python3-wmfbackups you should upgrade t... [11:14:42] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10taavi) >>! In T350484#9357212, @Kanashimi wrote: > @taavi Can you help me increase the limit for continuous jobs? The current quota is clearly not enough. > ` > # toolforge-jo... [11:39:31] 10Tool-bub2, 10Internet-Archive, 10Outreach-Programs-Projects, 10Outreachy (Round 27): For PDL, download and stream the PDF if available - https://phabricator.wikimedia.org/T348188 (10Maryann-Onyinye) a:05Razeetech→03DO-NOT-CHANGE [12:02:35] 10Grid-Engine-to-K8s-Migration: Migrate spi-table-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320054 (10Mz7) 05Open→03Resolved Alrighy, sorry again for the severe delay in getting this done. I think I have figured it out now. No longer is spi-table-bot running b... [12:06:33] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:10:40] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) Looking further into this, toolsbeta-harbor and tools-harbor are both set up without a generic system admin user. * gitlab CI/CD and tekton ("image-builder") have separate robot acc... [12:33:17] 10PAWS: jupyterlab to 4.0.9 - https://phabricator.wikimedia.org/T351726 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/351 [12:33:18] vivian-rook closed https://github.com/toolforge/paws/pull/351 [12:33:23] 10PAWS: jupyterlab to 4.0.9 - https://phabricator.wikimedia.org/T351726 (10rook) 05Open→03Resolved a:03rook [12:42:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:48:20] 10Toolforge (Toolforge iteration 02): Give builds-api access to system admin credentials - https://phabricator.wikimedia.org/T352007 (10Slst2020) 05Open→03Invalid production harbor already has access to the maintain-harbor user's credentials, which although not a system-admin user is enough for this use case... [12:48:25] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) [12:53:37] 10Toolforge (Toolforge iteration 02): [builds-api] Use regular user credentials for Harbor API auth in dev - https://phabricator.wikimedia.org/T352022 (10Slst2020) [13:01:39] RECOVERY - Check unit status of backup_glance_images on cloudbackup1003 is OK: OK: Status of the systemd unit backup_glance_images https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:14:04] (SystemdUnitDownForLong) resolved: The systemd unit backup_glance_images.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [13:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:18:37] (CephSlowOps) firing: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:18:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [13:23:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:29:02] (03CR) 10Jforrester: "Nice!" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/975400 (https://phabricator.wikimedia.org/T350778) (owner: 10Merlijn van Deen) [13:29:29] (03CR) 10Jforrester: [C: 03+1] "<3" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/977293 (owner: 10Merlijn van Deen) [13:34:08] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:14:14] 10Cloud-VPS, 10cloud-services-team, 10decommission-hardware: decommission cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10Andrew) @RobH We no longer need these servers but they're not due for rotation until 2026. Hopefully we (or someone) finds a futu... [14:14:23] 10Cloud-VPS, 10cloud-services-team, 10decommission-hardware: reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10Andrew) [14:15:15] 10Cloud-VPS, 10cloud-services-team (Hardware), 10decommission-hardware, 10ops-eqiad: reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10taavi) [14:15:36] 10Cloud-VPS, 10cloud-services-team (Hardware), 10decommission-hardware, 10ops-eqiad: reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10taavi) [14:26:15] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10fnegri) Good catch, I only checked if there was anything on <0.8.3 and didn't notice the `u1` vs `u2` difference! I have now upgraded a... [14:50:13] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) Thank you, and sorry for the urgency- normally these kind of packages always keep backwards compatibility (and they did here to... [14:51:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bookworm [14:59:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1053. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1053 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:59:34] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1049. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1049 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:59:39] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1050. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:59:44] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:00:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1034. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1034 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:01:33] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1038. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1038 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:01:34] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:01:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1060. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1060 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:03:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1059. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1059 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:03:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirtlocal1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirtlocal1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:04:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1049. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1049 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:04:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:04:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1050. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1056. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt-wdqs1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt-wdqs1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1038. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1038 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:44] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:06:49] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1051. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1051 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:07:34] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1039. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1039 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:07:34] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1042. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1042 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:07:39] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1067. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:08:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1035. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1035 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:10:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1044. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1044 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:11:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1051. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1051 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:11:34] (SystemdUnitDown) firing: (5) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1031. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:12:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1042. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1042 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:12:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1067. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:12:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1039. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1039 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:16:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1060. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1060 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:16:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1031. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:24:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:24:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1049. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1049 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:24:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1053. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1053 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:25:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1034. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1034 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:26:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1038. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1038 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:27:34] 10Grid-Engine-to-K8s-Migration: Migrate dbreps from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319665 (10Legoktm) The main dbreps job is running k8s now, it's just the `build-rust.sh` script that needs to be ported. [15:28:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1059. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1059 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:28:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirtlocal1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirtlocal1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:29:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1050. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:29:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:29:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1049. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1049 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:33] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt-wdqs1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt-wdqs1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1056. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:31:44] (SystemdUnitDown) resolved: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1060. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1060 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:32:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1067. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:32:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1042. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1042 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:32:48] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1067. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:33:03] (SystemdUnitDown) firing: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1060. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1060 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:33:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1035. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1035 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:35:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1044. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1044 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:36:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt-wdqs1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt-wdqs1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:36:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1051. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1051 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:36:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1031. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:36:48] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1031. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:37:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1067. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:37:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1042. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1042 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:37:39] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1039. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1039 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:41:34] (SystemdUnitDown) firing: (6) The service unit libvirtd-admin.socket is in failed status on host cloudvirt1031. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:42:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:46:52] 10Tools: The link in the CropTool interface doesn’t work - https://phabricator.wikimedia.org/T352034 (10JJMC89) 05Open→03Invalid Issues are tracked at https://github.com/danmichaelo/croptool/issues. [15:46:54] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10decommission-hardware, 10ops-eqiad: reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10Jclark-ctr) [15:50:03] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10decommission-hardware, 10ops-eqiad: reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10Jclark-ctr) 05Open→03Resolved a:05taavi→03Jclark-ctr [15:53:03] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1060. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1060 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:08:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1059. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1059 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:13:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1053:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:14:04] 10cloud-services-team: PuppetFailure cloudvirt1053:9100 Puppet failure on cloudvirt1053:9100 - https://phabricator.wikimedia.org/T352037 (10phaultfinder) [16:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:14:55] 10cloud-services-team, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Jclark-ctr) [16:14:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1050:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:14:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1043:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:15:06] 10cloud-services-team, 10SRE, 10decommission-hardware, 10ops-eqiad: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:15:08] 10cloud-services-team: PuppetFailure cloudvirt1050:9100 Puppet failure on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T352038 (10phaultfinder) [16:15:10] 10cloud-services-team: PuppetFailure cloudvirt1043:9100 Puppet failure on cloudvirt1043:9100 - https://phabricator.wikimedia.org/T352039 (10phaultfinder) [16:15:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1049:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:15:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:15:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1038:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:16:05] 10cloud-services-team: PuppetFailure cloudvirt1049:9100 Puppet failure on cloudvirt1049:9100 - https://phabricator.wikimedia.org/T352040 (10phaultfinder) [16:16:07] 10cloud-services-team: PuppetFailure cloudvirt1034:9100 Puppet failure on cloudvirt1034:9100 - https://phabricator.wikimedia.org/T352041 (10phaultfinder) [16:16:09] 10cloud-services-team: PuppetFailure cloudvirt1038:9100 Puppet failure on cloudvirt1038:9100 - https://phabricator.wikimedia.org/T352042 (10phaultfinder) [16:17:06] 10cloud-services-team: PuppetFailure cloudvirt1047:9100 Puppet failure on cloudvirt1047:9100 - https://phabricator.wikimedia.org/T352043 (10phaultfinder) [16:17:14] (PuppetFailure) firing: Puppet has failed on cloudvirt1047:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:17:28] (WidespreadPuppetFailure) firing: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [16:17:59] (PuppetFailure) firing: Puppet has failed on cloudvirtlocal1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:18:04] 10cloud-services-team: PuppetFailure cloudvirtlocal1003:9100 Puppet failure on cloudvirtlocal1003:9100 - https://phabricator.wikimedia.org/T352044 (10phaultfinder) [16:18:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1059:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:19:05] 10cloud-services-team: PuppetFailure cloudvirt1059:9100 Puppet failure on cloudvirt1059:9100 - https://phabricator.wikimedia.org/T352045 (10phaultfinder) [16:20:59] (PuppetFailure) firing: Puppet has failed on cloudvirt-wdqs1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:20:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:21:03] 10cloud-services-team: PuppetFailure cloudvirt-wdqs1002:9100 Puppet failure on cloudvirt-wdqs1002:9100 - https://phabricator.wikimedia.org/T352046 (10phaultfinder) [16:21:05] 10cloud-services-team: PuppetFailure cloudvirt1056:9100 Puppet failure on cloudvirt1056:9100 - https://phabricator.wikimedia.org/T352047 (10phaultfinder) [16:21:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:22:04] 10cloud-services-team: PuppetFailure cloudvirt1051:9100 Puppet failure on cloudvirt1051:9100 - https://phabricator.wikimedia.org/T352048 (10phaultfinder) [16:22:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:22:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:22:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1067:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:23:03] (PuppetFailure) firing: Puppet has failed on cloudvirt1042:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:23:04] 10cloud-services-team: PuppetFailure cloudvirt1039:9100 Puppet failure on cloudvirt1039:9100 - https://phabricator.wikimedia.org/T352049 (10phaultfinder) [16:23:06] 10cloud-services-team: PuppetFailure cloudvirt1042:9100 Puppet failure on cloudvirt1042:9100 - https://phabricator.wikimedia.org/T352051 (10phaultfinder) [16:23:08] 10cloud-services-team: PuppetFailure cloudvirt1067:9100 Puppet failure on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T352050 (10phaultfinder) [16:23:10] 10cloud-services-team: PuppetFailure cloudvirt1035:9100 Puppet failure on cloudvirt1035:9100 - https://phabricator.wikimedia.org/T352052 (10phaultfinder) [16:24:35] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:24:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1044:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:25:04] (SystemdUnitDown) firing: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:25:04] 10cloud-services-team: PuppetFailure cloudvirt1044:9100 Puppet failure on cloudvirt1044:9100 - https://phabricator.wikimedia.org/T352053 (10phaultfinder) [16:26:59] (PuppetFailure) firing: Puppet has failed on cloudvirt1031:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:27:04] 10cloud-services-team: PuppetFailure cloudvirt1031:9100 Puppet failure on cloudvirt1031:9100 - https://phabricator.wikimedia.org/T352054 (10phaultfinder) [16:28:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1059:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:34:58] 10Toolforge: Toolforge Kubernetes quota requests.memory was reduced - https://phabricator.wikimedia.org/T352055 (10Bamyers99) [16:34:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1044:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:35:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1044. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1044 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:36:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1031. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1031 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:36:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1031:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:37:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1039. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1039 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:37:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1039:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:52:48] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1001.eqiad.wmnet with OS bookworm completed: - cloudvirt... [16:53:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1053:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:54:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1053. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1053 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:54:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1049. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1049 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:54:48] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1043. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1043 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:54:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1050:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:55:33] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1034. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1034 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:55:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1049:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:55:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:55:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1038:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:56:33] (SystemdUnitDownForLong) firing: The systemd unit libvirtd-tls.socket on node cloudvirt1047 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [16:56:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1038. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1038 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:56:38] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1047:9100 Unit libvirtd-tls.socket on node cloudvirt1047 has been down for long. - https://phabricator.wikimedia.org/T352056 (10phaultfinder) [16:57:59] (PuppetFailure) resolved: Puppet has failed on cloudvirtlocal1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:58:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirtlocal1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirtlocal1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:59:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1050. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:59:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1043:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:00:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:01:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1056. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:01:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt-wdqs1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt-wdqs1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:01:39] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1051. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1051 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:02:29] (WidespreadPuppetFailure) resolved: Puppet has failed on wmcs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wmcs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:02:33] (SystemdUnitDownForLong) firing: The systemd unit libvirtd-tls.socket on node cloudvirt1067 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [17:02:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1042. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1042 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:02:39] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1067:9100 Unit libvirtd-tls.socket on node cloudvirt1067 has been down for long. - https://phabricator.wikimedia.org/T352057 (10phaultfinder) [17:02:39] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1067. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:02:40] RECOVERY - Check unit status of backup_vms on cloudbackup1004 is OK: OK: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms [17:02:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1035:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:02:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1042:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:03:33] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1035. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1035 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:05:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt-wdqs1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:06:33] (SystemdUnitDownForLong) resolved: The systemd unit libvirtd-tls.socket on node cloudvirt1047 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [17:06:34] (SystemdUnitDown) resolved: The service unit libvirtd-tls.socket is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:06:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bookworm [17:06:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1051:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:07:34] (SystemdUnitDownForLong) resolved: The systemd unit libvirtd-tls.socket on node cloudvirt1067 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1067 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [17:07:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1067:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:08:45] 10Toolforge: Toolforge Kubernetes quota requests.memory was reduced - https://phabricator.wikimedia.org/T352055 (10Bamyers99) [17:09:42] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1067:9100 Unit libvirtd-tls.socket on node cloudvirt1067 has been down for long. - https://phabricator.wikimedia.org/T352057 (10taavi) 05Open→03Resolved a:03taavi [17:09:45] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1047:9100 Unit libvirtd-tls.socket on node cloudvirt1047 has been down for long. - https://phabricator.wikimedia.org/T352056 (10taavi) 05Open→03Resolved a:03taavi [17:09:47] 10cloud-services-team: PuppetFailure cloudvirt1031:9100 Puppet failure on cloudvirt1031:9100 - https://phabricator.wikimedia.org/T352054 (10taavi) 05Open→03Resolved a:03taavi [17:09:49] 10cloud-services-team: PuppetFailure cloudvirt1044:9100 Puppet failure on cloudvirt1044:9100 - https://phabricator.wikimedia.org/T352053 (10taavi) 05Open→03Resolved a:03taavi [17:09:51] 10cloud-services-team: PuppetFailure cloudvirt1042:9100 Puppet failure on cloudvirt1042:9100 - https://phabricator.wikimedia.org/T352051 (10taavi) 05Open→03Resolved a:03taavi [17:09:53] 10cloud-services-team: PuppetFailure cloudvirt1035:9100 Puppet failure on cloudvirt1035:9100 - https://phabricator.wikimedia.org/T352052 (10taavi) 05Open→03Resolved a:03taavi [17:09:56] 10cloud-services-team: PuppetFailure cloudvirt1067:9100 Puppet failure on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T352050 (10taavi) 05Open→03Resolved a:03taavi [17:09:58] 10cloud-services-team: PuppetFailure cloudvirt1051:9100 Puppet failure on cloudvirt1051:9100 - https://phabricator.wikimedia.org/T352048 (10taavi) 05Open→03Resolved a:03taavi [17:10:00] 10cloud-services-team: PuppetFailure cloudvirt1039:9100 Puppet failure on cloudvirt1039:9100 - https://phabricator.wikimedia.org/T352049 (10taavi) 05Open→03Resolved a:03taavi [17:10:02] 10cloud-services-team: PuppetFailure cloudvirt1056:9100 Puppet failure on cloudvirt1056:9100 - https://phabricator.wikimedia.org/T352047 (10taavi) 05Open→03Resolved a:03taavi [17:10:04] 10cloud-services-team: PuppetFailure cloudvirt-wdqs1002:9100 Puppet failure on cloudvirt-wdqs1002:9100 - https://phabricator.wikimedia.org/T352046 (10taavi) 05Open→03Resolved a:03taavi [17:10:06] 10cloud-services-team: PuppetFailure cloudvirt1059:9100 Puppet failure on cloudvirt1059:9100 - https://phabricator.wikimedia.org/T352045 (10taavi) 05Open→03Resolved a:03taavi [17:10:08] 10cloud-services-team: PuppetFailure cloudvirtlocal1003:9100 Puppet failure on cloudvirtlocal1003:9100 - https://phabricator.wikimedia.org/T352044 (10taavi) 05Open→03Resolved a:03taavi [17:10:10] 10cloud-services-team: PuppetFailure cloudvirt1034:9100 Puppet failure on cloudvirt1034:9100 - https://phabricator.wikimedia.org/T352041 (10taavi) 05Open→03Resolved a:03taavi [17:10:12] 10cloud-services-team: PuppetFailure cloudvirt1047:9100 Puppet failure on cloudvirt1047:9100 - https://phabricator.wikimedia.org/T352043 (10taavi) 05Open→03Resolved a:03taavi [17:10:14] 10cloud-services-team: PuppetFailure cloudvirt1038:9100 Puppet failure on cloudvirt1038:9100 - https://phabricator.wikimedia.org/T352042 (10taavi) 05Open→03Resolved a:03taavi [17:10:16] 10cloud-services-team: PuppetFailure cloudvirt1049:9100 Puppet failure on cloudvirt1049:9100 - https://phabricator.wikimedia.org/T352040 (10taavi) 05Open→03Resolved a:03taavi [17:10:18] 10cloud-services-team: SystemdUnitDownForLong cloudbackup1003:9100 - https://phabricator.wikimedia.org/T351979 (10taavi) 05Open→03Resolved a:03taavi [17:10:20] 10cloud-services-team: PuppetFailure cloudvirt1050:9100 Puppet failure on cloudvirt1050:9100 - https://phabricator.wikimedia.org/T352038 (10taavi) 05Open→03Resolved a:03taavi [17:10:22] 10cloud-services-team: PuppetFailure cloudvirt1043:9100 Puppet failure on cloudvirt1043:9100 - https://phabricator.wikimedia.org/T352039 (10taavi) 05Open→03Resolved a:03taavi [17:10:24] 10cloud-services-team: PuppetFailure cloudvirt1053:9100 Puppet failure on cloudvirt1053:9100 - https://phabricator.wikimedia.org/T352037 (10taavi) 05Open→03Resolved a:03taavi [17:10:26] 10cloud-services-team: NeutronAgentDown cloudvirt1028 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351541 (10taavi) 05Open→03Resolved a:03taavi [17:10:29] 10cloud-services-team: NeutronAgentDown cloudvirt1029 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351540 (10taavi) 05Open→03Resolved a:03taavi [17:10:31] 10cloud-services-team: PuppetZeroResources cloudcontrol2004-dev:9100 Zero Puppet resources on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T351739 (10taavi) 05Open→03Resolved a:03taavi [17:10:33] 10cloud-services-team: NeutronAgentDown cloudvirt1025 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351539 (10taavi) 05Open→03Resolved a:03taavi [17:10:35] 10cloud-services-team: NeutronAgentDown cloudvirt1027 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351537 (10taavi) 05Open→03Resolved a:03taavi [17:10:37] 10cloud-services-team: NeutronAgentDown cloudvirt1046 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351538 (10taavi) 05Open→03Resolved a:03taavi [17:10:41] 10cloud-services-team: NeutronAgentDown cloudvirt1026 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351536 (10taavi) 05Open→03Resolved a:03taavi [17:10:43] 10cloud-services-team: NeutronAgentDown cloudvirt1030 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351535 (10taavi) 05Open→03Resolved a:03taavi [17:10:45] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1056:9100 Unit systemd-machined.service on node cloudvirt1056 has been down for long. - https://phabricator.wikimedia.org/T351188 (10taavi) 05Open→03Resolved a:03taavi [17:10:47] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1047:9100 Unit systemd-machined.service on node cloudvirt1047 has been down for long. - https://phabricator.wikimedia.org/T351187 (10taavi) 05Open→03Resolved a:03taavi [17:10:49] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1054:9100 Unit systemd-machined.service on node cloudvirt1054 has been down for long. - https://phabricator.wikimedia.org/T351185 (10taavi) 05Open→03Resolved a:03taavi [17:10:51] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1050:9100 Unit systemd-machined.service on node cloudvirt1050 has been down for long. - https://phabricator.wikimedia.org/T351186 (10taavi) 05Open→03Resolved a:03taavi [17:10:53] 10cloud-services-team: PuppetFailure cloudvirt2001-dev:9100 Puppet failure on cloudvirt2001-dev:9100 - https://phabricator.wikimedia.org/T351169 (10taavi) 05Open→03Resolved a:03taavi [17:10:55] 10cloud-services-team: PuppetFailure cloudcumin1001:9100 Puppet failure on cloudcumin1001:9100 - https://phabricator.wikimedia.org/T351013 (10taavi) 05Open→03Resolved a:03taavi [17:10:57] 10cloud-services-team: InterfaceSpeedError cloudvirt1066:9100 brq7425e328-56 on cloudvirt1066:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T351006 (10taavi) 05Open→03Resolved a:03taavi [17:10:59] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1065:9100 Unit wmf_auto_restart_virtlogd.service on node cloudvirt1065 has been down for long. - https://phabricator.wikimedia.org/T351005 (10taavi) 05Open→03Resolved a:03taavi [17:11:01] 10cloud-services-team: SystemdUnitDownForLong cloudbackup1004:9100 Unit purge_vm_backup.service on node cloudbackup1004 has been down for long. - https://phabricator.wikimedia.org/T350415 (10taavi) 05Open→03Resolved a:03taavi [17:11:03] 10cloud-services-team: HAProxyServiceUnavailable cloudlb1002:9900 HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://phabricator.wikimedia.org/T350358 (10taavi) 05Open→03Resolved a:03taavi [17:11:05] 10cloud-services-team: PuppetFailure cloudservices1006:9100 Puppet failure on cloudservices1006:9100 - https://phabricator.wikimedia.org/T350808 (10taavi) 05Open→03Resolved a:03taavi [17:11:07] 10cloud-services-team: PuppetFailure cloudvirt1064:9100 Puppet failure on cloudvirt1064:9100 - https://phabricator.wikimedia.org/T351004 (10taavi) 05Open→03Resolved a:03taavi [17:11:09] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 Unit keystone_rotate_keys.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T350198 (10taavi) 05Open→03Resolved a:03taavi [17:11:11] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 - https://phabricator.wikimedia.org/T350178 (10taavi) 05Open→03Resolved a:03taavi [17:11:13] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1005:9100 Unit keystone_rotate_keys.service on node cloudcontrol1005 has been down for long. - https://phabricator.wikimedia.org/T350207 (10taavi) 05Open→03Resolved a:03taavi [17:11:15] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 Unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T350146 (10taavi) 05Open→03Resolved a:03taavi [17:11:17] 10cloud-services-team: ProbeDown virt.cloudgw.eqiad1.wikimediacloud.org:0 - https://phabricator.wikimedia.org/T350139 (10taavi) 05Open→03Resolved a:03taavi [17:11:21] 10cloud-services-team: ProbeDown wan.cloudgw.eqiad1.wikimediacloud.org:0 - https://phabricator.wikimedia.org/T350140 (10taavi) 05Open→03Resolved a:03taavi [17:11:25] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1006:9100 Unit nova-fullstack.service on node cloudcontrol1006 has been down for long. - https://phabricator.wikimedia.org/T350144 (10taavi) 05Open→03Resolved a:03taavi [17:11:29] 10cloud-services-team: HAProxyServiceUnavailable cloudlb1001:9900 - https://phabricator.wikimedia.org/T350128 (10taavi) 05Open→03Resolved a:03taavi [17:11:31] 10cloud-services-team: PuppetFailure cloudcontrol1005:9100 Puppet failure on cloudcontrol1005:9100 - https://phabricator.wikimedia.org/T350115 (10taavi) 05Open→03Resolved a:03taavi [17:11:33] 10cloud-services-team: PuppetFailure clouddumps1002:9100 Puppet failure on clouddumps1002:9100 - https://phabricator.wikimedia.org/T350096 (10taavi) 05Open→03Resolved a:03taavi [17:11:35] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10taavi) 05Open→03Resolved [17:11:37] 10cloud-services-team: CephSlowOps Ceph cluster in has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349425 (10taavi) 05Open→03Resolved a:03taavi [17:11:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10taavi) [17:11:47] 10cloud-services-team, 10Observability-Alerting: Alertmanager Phabricator integration for WMCS alerts is too spammy - https://phabricator.wikimedia.org/T352059 (10taavi) [17:14:03] (SystemdUnitDown) resolved: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:14:04] (SystemdUnitDownForLong) resolved: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [17:31:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt1047:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:32:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:48:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1002.eqiad.wmnet with OS bookworm completed: - cloudvirt... [18:02:00] PROBLEM - ensure kvm processes are running on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:10:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet with OS bookworm [18:42:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:49:51] 10cloud-services-team, 10Observability-Alerting: Automatically close stale alertmanager created tasks - https://phabricator.wikimedia.org/T352079 (10taavi) [18:54:27] (03CR) 10Jforrester: "check experimental" [labs/countervandalism/cvn-api] - 10https://gerrit.wikimedia.org/r/879971 (owner: 10Krinkle) [18:55:31] (03CR) 10Jforrester: "check experimental" [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/882228 (https://phabricator.wikimedia.org/T306066) (owner: 10AntiCompositeNumber) [18:55:57] (03CR) 10Jforrester: "check experimental" [labs/tools/blankpages] - 10https://gerrit.wikimedia.org/r/963431 (owner: 10Krinkle) [18:56:30] (03CR) 10Jforrester: [C: 03+1] "check experimental" [labs/tools/coverme] - 10https://gerrit.wikimedia.org/r/896473 (owner: 10Libraryupgrader) [18:57:04] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirtlocal1003.eqiad.wmnet with OS bookworm completed: - cloudvirt... [18:57:08] (03CR) 10Jforrester: "check experimental" [labs/tools/extjsonuploader] - 10https://gerrit.wikimedia.org/r/904170 (owner: 10Brian Wolff) [18:57:20] (03CR) 10Jforrester: "check experimental" [labs/tools/force-rebase] - 10https://gerrit.wikimedia.org/r/881434 (owner: 10DannyS712) [18:57:46] (03CR) 10Jforrester: "check experimental" [labs/tools/intuition] - 10https://gerrit.wikimedia.org/r/977640 (owner: 10L10n-bot) [18:59:57] (03CR) 10Jforrester: "check experimental" [labs/tools/intuition-web] - 10https://gerrit.wikimedia.org/r/924068 (owner: 10L10n-bot) [19:00:18] (03CR) 10Jforrester: "check experimental" [labs/tools/usage] - 10https://gerrit.wikimedia.org/r/908978 (owner: 10Krinkle) [19:00:20] (03CR) 10Jforrester: "check experimental" [labs/tools/orphantalk] - 10https://gerrit.wikimedia.org/r/955708 (owner: 10L10n-bot) [19:00:52] (03CR) 10Jforrester: "check experimental" [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/977654 (owner: 10L10n-bot) [19:02:22] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2001 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:17:37] (CephSlowOps) firing: Ceph cluster in eqiad has 34 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:17:43] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352082 (10phaultfinder) [19:20:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10Andrew) [19:21:00] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10Andrew) 05In progress→03Resolved [19:21:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:22:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 103 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:26:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:35:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:41:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [19:43:37] (CephSlowOps) firing: Ceph cluster in eqiad has 94 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:43:41] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352082 (10phaultfinder) [19:48:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 94 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:50:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1025 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:53:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1026 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:53:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1030 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:53:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1027 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:53:15] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1029 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:53:20] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1028 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:54:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:54:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:57:03] (TfInfraTestDestroyFailed) resolved: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [19:57:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:00:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1025 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:00:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1028 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:00:10] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1030 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:00:15] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1026 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:00:20] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1029 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:00:25] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1027 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:04:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:18:33] 10Data-Services, 10cloud-services-team, 10Data Products: 'mbh' tool cannot log in to srwiki's DB replica on Toolforge - https://phabricator.wikimedia.org/T351316 (10bd808) [20:20:30] 10Data-Services, 10cloud-services-team: 'mbh' tool cannot log in to srwiki's DB replica on Toolforge - https://phabricator.wikimedia.org/T351316 (10VirginiaPoundstone) [20:37:53] 10Data-Services, 10cloud-services-team: 'mbh' tool cannot log in to srwiki's DB replica when attempting to connect using the legacy 'srwiki.labsdb' hostname - https://phabricator.wikimedia.org/T351316 (10bd808) [20:38:52] 10Data-Services, 10cloud-services-team, 10User-bd808: 'mbh' tool cannot log in to srwiki's DB replica when attempting to connect using the legacy 'srwiki.labsdb' hostname - https://phabricator.wikimedia.org/T351316 (10bd808) 05Open→03Resolved a:03bd808 The credentials in the /data/project/mbh/replica.m... [20:54:20] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:59:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:32:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:03:56] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) @taavi Thank you! I can run these jobs entirely. Also I would like to ask, when I set up a job to run in a schedule and let other jobs run after it, I find that the... [22:14:34] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:55:38] 10Tools: Template transclusion count tool support for non-wikipedia wikis - https://phabricator.wikimedia.org/T203962 (10Ladsgroup) Is there a ticket to make transclusion count part of mediawiki? That would fix this issue.