[00:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [02:13:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:33:19] (HAProxyBackendUnavailable) firing: HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:13:19] (HAProxyBackendUnavailable) firing: (2) HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:18:19] (HAProxyBackendUnavailable) firing: (2) HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:02:27] 10cloud-services-team: nova-api seems to die after a while - https://phabricator.wikimedia.org/T354483 (10Andrew) [04:02:55] 10cloud-services-team: nova-api seems to die after a while, complains of a full listen queue - https://phabricator.wikimedia.org/T354483 (10Andrew) [04:03:11] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 (10Andrew) [04:03:13] 10cloud-services-team: nova-api seems to die after a while, complains of a full listen queue - https://phabricator.wikimedia.org/T354483 (10Andrew) [04:08:19] (HAProxyBackendUnavailable) resolved: HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:13:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:34:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:39:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [09:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:08:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:08:56] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:13:42] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:18:20] 10Tool-hitaden, 10Toolforge Build Service: [buildservice,nodejs] nodejs buildpack does not take envvars into account - https://phabricator.wikimedia.org/T353557 (10Lofhi) So, I tried. ` [step-build] 2024-01-07T10:53:05.981377833Z [Installing Node.js distribution] [step-build] 2024-01-07T10:53:05.981862711Z Do... [12:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:16:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:21:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:36:22] 10Tools, 10Privacy Engineering, 10Privacy: 'request' Toolforge tool uses Cloudflare CDN - https://phabricator.wikimedia.org/T354488 (10LucasWerkmeister) [13:36:29] 10Grid-Engine-to-K8s-Migration: Migrate request from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320003 (10LucasWerkmeister) Just mentioning here that @Tkarcher has offered to adopt the tool: https://de.wikipedia.org/wiki/Benutzer_Diskussion:FNDE#Geplante_Adoption_deines_%22... [13:46:12] 10Tools, 10Privacy Engineering, 10Privacy: 'request' Toolforge tool uses Cloudflare CDN - https://phabricator.wikimedia.org/T354488 (10Tkarcher) I didn't even officially requested access to the tool yet and already got my first task assigned. 😂 Thanks, I guess... [13:49:57] 10Tools, 10Privacy Engineering, 10Privacy: 'request' Toolforge tool uses Cloudflare CDN - https://phabricator.wikimedia.org/T354488 (10LucasWerkmeister) Not assigned, just CCed so far :P you can use it like “look, there’s work that someone™ needs to do” in your access request ;) [15:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [17:25:05] 10Tools, 10Privacy: 'request' Toolforge tool uses Cloudflare CDN - https://phabricator.wikimedia.org/T354488 (10JJMC89) [17:25:11] 10Toolforge-standards-committee, 10Tools, 10Privacy Engineering, 10Privacy: Hunt for Toolforge tools that load resources from third party sites - https://phabricator.wikimedia.org/T172065 (10JJMC89) [17:25:14] 10Toolforge, 10Privacy Engineering, 10WMF-Legal, 10Epic, 10Privacy: [EPIC] Protect end-user privacy by restricting non-consensual third-party browser interactions - https://phabricator.wikimedia.org/T133919 (10JJMC89) [17:25:16] 10Tools, 10Privacy: 'request' Toolforge tool uses Cloudflare CDN - https://phabricator.wikimedia.org/T354488 (10JJMC89) [18:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:15:10] PROBLEM - Host cloudvirt1063 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1063 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:18:50] (PawsJupyterHubDown) firing: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:19:25] (NodeDown) firing: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [19:19:30] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T354491 (10phaultfinder) [19:26:31] Hmm... what is paws up to [19:27:50] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T354491 (10Andrew) [19:27:56] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) [19:28:03] 10cloud-services-team: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T353406 (10Andrew) [19:28:05] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T352595 (10Andrew) [19:28:09] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) 05Resolved→03Open [19:32:38] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) This host just died again. I've evacuated all non-canary VMs, waiting for it to cool down and restart so I can look at logs. [19:33:50] (PawsJupyterHubDown) resolved: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:58:04] 10Grid-Engine-to-K8s-Migration: Migrate blame from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319592 (10MBH) 05Open→03Resolved I restarted webservice under k8s. [20:00:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:01:01] 10Toolforge, 10Privacy Engineering, 10WMF-Legal, 10Epic, 10Privacy: [EPIC] Protect end-user privacy by restricting non-consensual third-party browser interactions - https://phabricator.wikimedia.org/T133919 (10Frostly) [20:05:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:04:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:14:25] (NodeDownForLong) firing: The node cloudvirt1063 has been unreachable for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDownForLong [21:14:30] 10cloud-services-team: NodeDownForLong Node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T354496 (10phaultfinder) [21:16:40] (NeutronAgentDownForLong) firing: Neutron neutron-linuxbridge-agent on cloudvirt1063 has been down for more than 2h - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDownForLong [21:16:46] 10cloud-services-team: NeutronAgentDownForLong A Neutron agent has been down for more than 2h, VMs will have connectivity issues - https://phabricator.wikimedia.org/T354497 (10phaultfinder) [23:21:20] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [23:26:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [23:54:31] 10Grid-Engine-to-K8s-Migration: Migrate request from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320003 (10FNDE) Hi folks, thanks for the reminder! Sorry, I missed this conversation here. I think I'd better migrate this now :) [23:56:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable