[00:01:37] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:02:22] (HAProxyServiceUnavailable) firing: (2) HAProxy service Abuse has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [00:02:27] 10cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T352544 (10phaultfinder) [00:03:52] (HAProxyBackendUnavailable) resolved: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:11:37] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:15:09] (03CR) 10Eugene233: [C: 03+2] Merge m2c branch to main [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998266 (https://phabricator.wikimedia.org/T356772) (owner: 10Eugene233) [00:15:36] (03Merged) 10jenkins-bot: Merge m2c branch to main [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998266 (https://phabricator.wikimedia.org/T356772) (owner: 10Eugene233) [00:18:52] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:31:37] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:32:22] (HAProxyServiceUnavailable) firing: (3) HAProxy service Abuse has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [00:32:28] 10cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T352544 (10phaultfinder) [00:33:52] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:33:56] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:37:22] (HAProxyServiceUnavailable) firing: (3) HAProxy service Abuse has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [00:38:56] (SystemdUnitDown) firing: (3) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:41:37] (HAProxyBackendUnavailable) resolved: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:42:22] (HAProxyServiceUnavailable) firing: (3) HAProxy service Abuse has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [00:43:56] (SystemdUnitDown) firing: (3) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:47:22] (HAProxyServiceUnavailable) resolved: (3) HAProxy service Abuse has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [00:52:52] (HAProxyBackendUnavailable) firing: (2) HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:55:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [00:57:05] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [00:57:52] (HAProxyBackendUnavailable) resolved: (2) HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:06:07] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:28:56] (SystemdUnitDown) firing: (5) The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:31:07] (HAProxyBackendUnavailable) resolved: (3) HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:32:23] (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [01:32:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [01:32:52] (NeutronAgentDown) firing: (51) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [01:32:59] (MetricsinfraAlertmanagerDown) resolved: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [01:37:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [01:38:56] (SystemdUnitDown) resolved: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:58:45] (NovafullstackSustainedFailures) resolved: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [01:59:21] (NeutronAgentDown) resolved: (51) Neutron neutron-linuxbridge-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [02:47:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 36074 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [03:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:12:01] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:21:26] (SystemdUnitDown) resolved: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:37:00] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:47:00] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:47:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 46874 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [06:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [06:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:14:09] 10Toolforge Build Service: [apt-buildpack] Need local Ubuntu mirror or package cache - https://phabricator.wikimedia.org/T357251 (10tstarling) [06:16:28] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10tstarling) This is basically done, but performance seems very bad. Please test to confirm that it's not just me. The server is not showing significant load while I... [07:25:30] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T356448 (10LibUp-bot) A new upstream version of OpenRefine is now available: 3.7.9. * https://github.com/OpenRefine/OpenRefine/releases/tag/3.7.9 [07:46:24] (03CR) 10Eugene233: [C: 03+2] "Thank you so much for this fix." [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998479 (owner: 10Juniorbesong) [07:46:50] (03Merged) 10jenkins-bot: BUG: T355466. Solve cannot import name "url_decode" from "werkzeug.urls" [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/998479 (owner: 10Juniorbesong) [07:52:01] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:55:03] (03PS1) 10Eugene233: SQL statement for pre-ping does not execute [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1000297 (https://phabricator.wikimedia.org/T355983) [07:57:49] (03CR) 10Eugene233: [C: 03+2] "Basic fix needs urgent testing." [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1000297 (https://phabricator.wikimedia.org/T355983) (owner: 10Eugene233) [07:58:13] (03Merged) 10jenkins-bot: SQL statement for pre-ping does not execute [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1000297 (https://phabricator.wikimedia.org/T355983) (owner: 10Eugene233) [08:07:01] (OpenstackAPIResponse) resolved: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:07:31] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:11:24] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1002 for host cloudweb1003.wikimedia.org with OS bullseye [08:27:59] 10Cloud-VPS: Automatically install Node.js on cloud instances - https://phabricator.wikimedia.org/T356441 (10taavi) 05Open→03Declined No, let's not pull Node, NPM and its hundreds of dependencies to all the instances where it would not be used in most of them. > While this might be straightforward for some... [08:31:31] 10Cloud-VPS, 10MediaWiki-Vagrant: Update Vagrant puppet role to work on Bookworm. - https://phabricator.wikimedia.org/T356551 (10taavi) a:03taavi [08:33:09] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10Tacsipacsi) The full page load on https://zoomviewer.toolforge.org/index.php?f=Seattle+7.jpg&flash=no took 48 seconds for me as well, but it didn’t feel very long –... [08:33:19] 10Cloud-VPS, 10cloud-services-team: Gather feedback from users of the 'unmanaged' debian-12.0-nopuppet image - https://phabricator.wikimedia.org/T355963 (10taavi) [08:47:31] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 57674 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [08:51:41] 10Cloud-VPS, 10MediaWiki-Vagrant, 10Patch-For-Review: Update Vagrant puppet role to work on Bookworm. - https://phabricator.wikimedia.org/T356551 (10taavi) The above patch fixes the Puppet provisioning error, however the vagrant-lxc plugin seems to be broken: `lines=15 taavi@taavi-vagrant:/srv/mediawiki-vag... [08:54:20] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1002 for host cloudweb1003.wikimedia.org with OS bullseye completed: - cloudweb1003 (**PASS**) - Do... [08:59:14] 10cloud-services-team, 10wikitech.wikimedia.org: Upgrade cloudweb hosts to Bullseye - https://phabricator.wikimedia.org/T356966 (10taavi) [09:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:12:45] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:13:45] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:22:31] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:23:15] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264 (10dcaro) p:05Triage→03High [09:23:19] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264 (10dcaro) 05Open→03In progress [09:27:06] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264 (10dcaro) [09:30:40] 10Cloud-VPS, 10cloud-services-team: Move cloudcontrol memcached flows to cloud-private - https://phabricator.wikimedia.org/T355417 (10taavi) 05Open→03Resolved [09:31:35] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account Adamham - https://phabricator.wikimedia.org/T348663 (10taavi) Any news here? [09:32:58] 10cloud-services-team, 10Bitu, 10Infrastructure-Foundations, 10LDAP, 10User-MoritzMuehlenhoff: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663 (10MoritzMuehlenhoff) [09:36:08] 10Grid-Engine-to-K8s-Migration, 10Toolforge: "My first Buildpack .NET tool" manual doesn't work due to ERR_CERT_INVALID - https://phabricator.wikimedia.org/T357206 (10dcaro) Yep, on toolforge the https endpoint is managed by the proxy, the webservices themselves just have to listen on port `$PORT` using http :) [09:42:18] 10Grid-Engine-to-K8s-Migration, 10Toolforge: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10dcaro) Note that there's no stability or availability assurance for any of the k8s APIs. I understand they are way more powerful than the APIs/abstractions that we... [09:42:43] 10Toolforge, 10cloud-services-team, 10Documentation, 10Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919 (10dcaro) Note that there's no stability or availability assurance for any of the k8s APIs. I understand... [09:45:25] 10Toolforge (Quota-requests): Request increased memory quota for wd-shex-infer Toolforge tool - https://phabricator.wikimedia.org/T357209 (10dcaro) +1 [09:45:28] 10Grid-Engine-to-K8s-Migration: Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320210 (10tstarling) It's really not a lot of data -- a previous maintainer turned the JPEG quality down to 50. In Chromium with a viewport width of 1373, reloading with cache... [09:46:37] 10Toolforge Build Service: [apt-buildpack] Add local Ubuntu mirror or package cache - https://phabricator.wikimedia.org/T357251 (10dcaro) [09:46:49] 10Toolforge Build Service: [apt-buildpack] Add local Ubuntu mirror or package cache - https://phabricator.wikimedia.org/T357251 (10dcaro) p:05Triage→03Medium [09:46:56] 10Toolforge Build Service: [apt-buildpack] Add local Ubuntu mirror or package cache - https://phabricator.wikimedia.org/T357251 (10dcaro) p:05Medium→03Low [09:47:30] 10Grid-Engine-to-K8s-Migration, 10Toolforge: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10dcaro) p:05Triage→03Low [09:50:03] 10cloud-services-team: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T357248 (10dcaro) This was a hiccup on neutron side: ` 02:45:13 Andrew Bogott Ok, quick wrap-up: It was not a denial of service. Neutron was in a split-brained state which meant it timed out on many operation... [09:50:12] 10cloud-services-team: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T357248 (10dcaro) 05Open→03Resolved a:03dcaro [09:51:12] 10cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T352544 (10dcaro) 05Open→03Resolved a:03dcaro This was a restart due to cloudrabbit unstability, making neutron unstable: ` 02:45:13 Andrew Bogott Ok, quick wrap-up: It was not a denial of service. Neutron... [09:51:27] 10cloud-services-team: CRITICAL - degraded: The following units failed: check-private-data.service on clouddb1015, 1019, 1021 - https://phabricator.wikimedia.org/T355953 (10taavi) 05Open→03Resolved ` Feb 12 05:04:51 clouddb1015 systemd[1]: check-private-data.service: Succeeded. ` [09:58:08] 10cloud-services-team: SystemdUnitDown Unit backup_vms.service on node cloudbackup1004 has been down for long. - https://phabricator.wikimedia.org/T357244 (10dcaro) This seems due to the same neutron outage yesterday: ` Feb 11 17:01:37 cloudbackup1004 wmcs-backup[28127]: 10Grid-Engine-to-K8s-Migration, 10Toolforge (Toolforge iteration 05): Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10taavi) a:03taavi [09:58:23] 10cloud-services-team: SystemdUnitDown Unit backup_vms.service on node cloudbackup1004 has been down for long. - https://phabricator.wikimedia.org/T357244 (10dcaro) 05Open→03Resolved a:03dcaro [10:00:53] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264 (10dcaro) Previous instance of this {T355411} [10:02:12] 10Grid-Engine-to-K8s-Migration, 10Toolforge (Toolforge iteration 05), 10Patch-For-Review: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/1... [10:21:49] 10Grid-Engine-to-K8s-Migration, 10Toolforge (Toolforge iteration 05), 10Patch-For-Review: Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/1... [10:22:15] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account Adamham - https://phabricator.wikimedia.org/T348663 (10Nahid) 05Open→03Declined We have closed the ticket on T&S' end as we were not successful in confirming the identity. I will close... [10:32:32] 10Grid-Engine-to-K8s-Migration, 10Toolforge (Toolforge iteration 05): Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/189 maintain-kubeusers:... [10:32:57] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:33:09] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [10:33:15] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:33:28] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [10:33:46] 10Grid-Engine-to-K8s-Migration, 10Toolforge (Toolforge iteration 05): Tool user not allowed to read jobs/status in Kubernetes - https://phabricator.wikimedia.org/T357172 (10taavi) 05Open→03Resolved [10:33:48] 10Grid-Engine-to-K8s-Migration: Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320140 (10taavi) [10:48:45] !log sstefanova@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [10:49:00] !log sstefanova@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [10:50:58] !log sstefanova@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [10:51:11] !log sstefanova@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [10:55:16] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10Slst2020) 05In progress→03Resolved [10:56:18] 10Toolforge (Toolforge iteration 05): [Toolforge CLI consolidation] Explore OpenAPI tooling - https://phabricator.wikimedia.org/T356261 (10Slst2020) a:03Slst2020 [10:57:19] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10taavi) @Slst2020 this does not seem resolved to me? I can still reproduce the issue and there are no patches attached to this task. [11:14:34] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10CodeReviewBot) sstefanova updated https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/77 quota: show an error if project does not... [11:18:37] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10Slst2020) >>! In T353701#9533076, @taavi wrote: > @Slst2020 this does not seem resolved to me? I can still reproduce the issue and there are no patches attac... [11:19:50] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10taavi) Ok, then this task should be stalled and have the robot account permissions task added as a subtask, instead of being marked as Resolved. [11:21:22] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10Slst2020) 05Resolved→03Stalled [11:22:42] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701 (10Slst2020) [11:22:46] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service: [harbor] upgrade to 2.10.x - https://phabricator.wikimedia.org/T354507 (10Slst2020) [12:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:13:38] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Some VPS instances still using ns-recursor0 - https://phabricator.wikimedia.org/T346426 (10taavi) I think we can remove the redirects here. If someone has Puppet broken for months and did not react to the cloud-announce email when thi... [12:15:23] 10Cloud-VPS, 10cloud-services-team: Use cloud-private and cfssl certs for instance live migrations - https://phabricator.wikimedia.org/T355145 (10taavi) p:05Triage→03Low [12:15:33] 10Cloud-VPS, 10cloud-services-team: Move Cloud VPS internal flows from cloud-hosts to cloud-private - https://phabricator.wikimedia.org/T355416 (10taavi) p:05Triage→03Medium [12:20:00] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Epic, 10Goal, 10User-aborrero: openstack eqiad1: introduce cloud-private and cloudlb - https://phabricator.wikimedia.org/T341060 (10taavi) [12:21:27] 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review, 10User-aborrero: Some VPS instances still using ns-recursor0 - https://phabricator.wikimedia.org/T346426 (10aborrero) >>! In T346426#9533557, @taavi wrote: > I think we can remove the redirects here. If someone has Puppet broken for months and did not... [12:22:51] 10Toolforge, 10cloud-services-team, 10Documentation, 10Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919 (10Anomie) Provide something better that fits the requirements and I'll look at using it. Last I've heard... [12:24:41] 10cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T357234 (10dcaro) 05Open→03Resolved a:03dcaro This is running again, might be related to the neutron outage, same as {T357244} [12:25:25] 10Toolforge, 10cloud-services-team: Elasticsearch credential request for capacity-exchange - https://phabricator.wikimedia.org/T357227 (10taavi) [12:34:42] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/1002458 (owner: 10L10n-bot) [12:35:55] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-52 [12:36:32] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-52 [12:36:38] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-53 [12:37:14] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-53 [12:37:20] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [12:38:04] 10Toolforge, 10cloud-services-team, 10Documentation, 10Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919 (10dcaro) >>! In T321919#9533573, @Anomie wrote: > Provide something better that fits the requirements an... [12:44:51] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the tools cluster [12:45:59] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-k8s-worker-nfs-15 [12:46:14] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-k8s-worker-nfs-15 [12:46:23] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [12:55:04] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T356448 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/374 [12:55:23] vivian-rook opened https://github.com/toolforge/paws/pull/374 [12:56:03] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-15.tools.eqiad1.wikimedia.cloud to the cluster [12:56:03] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [12:57:46] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-54 [12:58:22] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-54 [12:58:37] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-55 [12:59:13] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-55 [12:59:25] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [13:04:03] (03Abandoned) 10Kosta Harlan: Link to merging patches docs and add as first step [labs/tools/deploy-commands] - 10https://gerrit.wikimedia.org/r/720741 (owner: 10Kosta Harlan) [13:04:08] (03Abandoned) 10Kosta Harlan: Link to docs about verifying on mwdebug [labs/tools/deploy-commands] - 10https://gerrit.wikimedia.org/r/720742 (owner: 10Kosta Harlan) [13:09:13] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-16.tools.eqiad1.wikimedia.cloud to the cluster [13:09:13] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [13:09:23] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-56 [13:10:00] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-56 [13:10:08] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-57 [13:10:46] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-57 [13:12:03] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [13:12:41] (CloudVPSDesignateLeaks) firing: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:16:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-56 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:41] (CloudVPSDesignateLeaks) firing: (2) Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:21:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-56 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:22:26] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-17.tools.eqiad1.wikimedia.cloud to the cluster [13:22:26] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [13:23:50] 10Toolforge Jobs framework, 10User-aborrero: Support tool-internal networking - https://phabricator.wikimedia.org/T348758 (10aborrero) [13:27:24] 10Toolforge Jobs framework, 10cloud-services-team, 10User-Raymond_Ndibe: Toolforge jobs framework: introduce swagger to the API - https://phabricator.wikimedia.org/T327279 (10aborrero) [13:29:24] 10Toolforge (Toolforge iteration 05), 10Toolforge Jobs framework, 10Patch-For-Review, 10User-aborrero: toolforge: introduce OpenAPI to jobs framework - https://phabricator.wikimedia.org/T356523 (10aborrero) [13:29:48] 10Toolforge Jobs framework, 10cloud-services-team, 10User-aborrero: Toolforge: consider introducing a command line for creating reverse proxies - https://phabricator.wikimedia.org/T337191 (10aborrero) [13:29:50] 10Toolforge Jobs framework, 10cloud-services-team, 10User-aborrero: Toolforge: consider introducing a command line for creating reverse proxies - https://phabricator.wikimedia.org/T337191 (10aborrero) [13:32:52] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-58 [13:33:39] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-58 [13:33:53] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-59 [13:34:13] 10Toolforge, 10cloud-services-team, 10User-aborrero: Toolforge: consider introducing a command line for creating reverse proxies - https://phabricator.wikimedia.org/T337191 (10aborrero) [13:34:32] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-59 [13:34:51] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [13:40:28] (InstanceDown) firing: Project tools instance tools-k8s-worker-59 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:43:55] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-18.tools.eqiad1.wikimedia.cloud to the cluster [13:43:55] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [13:45:28] (InstanceDown) resolved: Project tools instance tools-k8s-worker-59 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:46:30] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-60 [13:47:06] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-60 [13:50:22] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [13:54:08] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge... [13:55:22] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:03:32] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge... [14:10:31] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repo... [14:25:20] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge... [14:26:10] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-61 [14:26:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-61 [14:35:55] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [14:39:24] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/1002458 (owner: 10L10n-bot) [14:42:18] (03CR) 10Jforrester: [C: 03+2] "Yeah, we should finish the stylelint 16 upgrade for stylelint-config-wikimedia. Thanks!" [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/999115 (owner: 10Majavah) [14:42:55] (03Merged) 10jenkins-bot: Bump stylelint to 15.10.1 [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/999115 (owner: 10Majavah) [14:47:16] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-19.tools.eqiad1.wikimedia.cloud to the cluster [14:47:16] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [14:47:40] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.remove_k8s_node for host tools-k8s-worker-62 [14:48:19] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.remove_k8s_node (exit_code=0) for host tools-k8s-worker-62 [14:48:39] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the tools cluster [14:58:23] !log taavi@cloudcumin1001 tools Added a new k8s worker-nfs tools-k8s-worker-nfs-20.tools.eqiad1.wikimedia.cloud to the cluster [14:58:24] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the tools cluster [15:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:04:30] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/build... [15:29:40] 10cloud-services-team, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q3): Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457 (10joanna_borun) [15:31:02] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10netbox: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10joanna_borun) p:05Triage→03Medium [15:36:28] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Infrastructure-Foundations: Remove wmcs-admin access from production cumin hosts - https://phabricator.wikimedia.org/T347979 (10MoritzMuehlenhoff) p:05Triage→03Low [15:36:58] 10cloud-services-team, 10Bitu, 10Infrastructure-Foundations, 10LDAP, 10User-MoritzMuehlenhoff: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663 (10MoritzMuehlenhoff) p:05Triage→03Low [15:42:15] 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10rook) 05Resolved→03Open [15:42:43] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudnet.reboot_node [15:43:23] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudnet.reboot_node (exit_code=99) [15:46:09] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [15:46:30] 10PAWS: Upgrade Jupyterlab - https://phabricator.wikimedia.org/T357027 (10rook) notebook looks like it is downgrading jupyterlab. Notebook is upgrading to 7.1, but is not quite there. Until then it requires jupyterlab of less than 4.1.0. We can wait a little while to see if it resolves. [15:50:30] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) [15:51:44] (InterfaceSpeedError) firing: brq05a5494a-18 on cloudvirt2001-dev:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [15:51:49] 10cloud-services-team: InterfaceSpeedError brq05a5494a-18 on cloudvirt2001-dev:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357319 (10phaultfinder) [15:52:42] 10cloud-services-team: InterfaceSpeedError brq05a5494a-18 on cloudvirt2001-dev:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357319 (10taavi) a:03taavi Looking as I just rebooted this host. [15:54:46] 10cloud-services-team, 10SRE: ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531 (10joanna_borun) [15:56:41] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T356448 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/374 [15:56:44] (InterfaceSpeedError) resolved: brq05a5494a-18 on cloudvirt2001-dev:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [15:56:51] vivian-rook closed https://github.com/toolforge/paws/pull/374 [15:57:25] 10PAWS: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T356448 (10rook) 05Open→03Resolved a:03rook [15:57:46] 10cloud-services-team: InterfaceSpeedError brq05a5494a-18 on cloudvirt2001-dev:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T357319 (10taavi) 05Open→03Resolved It fixed itself. [15:59:07] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [16:04:32] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) [16:10:32] 10Toolforge (Toolforge iteration 05), 10Toolforge Jobs framework, 10Patch-For-Review, 10User-aborrero: toolforge: introduce OpenAPI to jobs framework - https://phabricator.wikimedia.org/T356523 (10aborrero) Out of curiosity, I generated the server code using https://openapi-generator.tech/ , and I got this... [16:14:49] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [16:20:53] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) [16:21:02] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [16:25:35] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge... [16:26:57] 10Toolforge Build Service, 10Patch-For-Review: builds-cli utils/bump_version.sh fails with '--userns: invalid USER mode.' - https://phabricator.wikimedia.org/T354876 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/56 d/changelog: bump to 0.... [16:27:01] 10Toolforge (Toolforge iteration 04), 10Patch-For-Review: [ci] Add shellcheck to pre-commit where missing - https://phabricator.wikimedia.org/T353052 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/56 d/changelog: bump to 0.0.13 [16:27:09] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge... [16:27:11] 10Toolforge Build Service, 10Patch-For-Review: builds-cli utils/bump_version.sh fails with '--userns: invalid USER mode.' - https://phabricator.wikimedia.org/T354876 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/56 d/changelog: bump to 0.... [16:27:16] 10Toolforge (Toolforge iteration 04), 10Patch-For-Review: [ci] Add shellcheck to pre-commit where missing - https://phabricator.wikimedia.org/T353052 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/56 d/changelog: bump to 0.0.13 [16:29:55] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) [16:30:09] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [16:36:52] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [16:39:22] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [16:39:45] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [16:52:02] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [16:53:36] 10Tool-gitlab-account-approval: "LDAPInvalidFilterError: malformed filter" error checking user https://gitlab.wikimedia.org/haak - https://phabricator.wikimedia.org/T357328 (10bd808) [17:03:57] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolf... [17:04:01] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-... [17:04:38] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) [17:06:08] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [17:07:51] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10Raymond_Ndibe) [17:07:57] 10Toolforge Build Service, 10Documentation: [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Raymond_Ndibe) [17:08:33] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: [builds-api] refactor build start response type - https://phabricator.wikimedia.org/T356724 (10Raymond_Ndibe) 05In progress→03Resolved [17:09:10] 10Toolforge (Toolforge iteration 05), 10Toolforge Build Service, 10Patch-For-Review, 10User-Raymond_Ndibe: alert users when they are about to exceed their harbor quota - https://phabricator.wikimedia.org/T353535 (10Raymond_Ndibe) 05In progress→03Resolved [17:10:12] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) [17:17:41] (CloudVPSDesignateLeaks) firing: (2) Detected 44 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:21:42] !log taavi@runko admin START - Cookbook wmcs.openstack.cloudnet.reboot_node [17:21:43] 10Tool-gitlab-account-approval, 10Patch-For-Review, 10User-bd808: "LDAPInvalidFilterError: malformed filter" error checking user https://gitlab.wikimedia.org/haak - https://phabricator.wikimedia.org/T357328 (10CodeReviewBot) bd808 opened https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/... [17:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:22:09] 10Grid-Engine-to-K8s-Migration: Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319965 (10Soda) I'm looking into migrating some of the usable aspects (statistics and match + split) of phetools into seperate standalone tools. This might take a while however,... [17:22:57] 10Tool-gitlab-account-approval, 10Patch-For-Review, 10User-bd808: "LDAPInvalidFilterError: malformed filter" error checking user https://gitlab.wikimedia.org/haak - https://phabricator.wikimedia.org/T357328 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/... [17:23:02] !log taavi@runko admin END (FAIL) - Cookbook wmcs.openstack.cloudnet.reboot_node (exit_code=99) [17:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:25:08] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudnet.reboot_node [17:28:00] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudnet.reboot_node (exit_code=0) [17:30:08] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudnet.reboot_node [17:33:24] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudnet.reboot_node (exit_code=0) [17:37:03] 10Tool-gitlab-account-approval, 10User-bd808: "LDAPInvalidFilterError: malformed filter" error checking user https://gitlab.wikimedia.org/haak - https://phabricator.wikimedia.org/T357328 (10bd808) 05In progress→03Resolved [17:40:00] (NovafullstackSustainedFailures) firing: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [17:40:05] 10cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T357335 (10phaultfinder) [17:58:46] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, 10Epic: Investigate: How to make the GUC query performant - https://phabricator.wikimedia.org/T355672 (10Tchanders) Thanks @MusikAnimal, this is really helpful. Noting down some thoughts following a conversati... [18:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:24:45] 10Toolforge: rm'ing a specific file on NFS hangs on (dev|login).toolforge.org - https://phabricator.wikimedia.org/T357340 (10Count_Count) [18:29:01] (ToolsToolsDBReplicationLagIsTooHigh) resolved: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 3661 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [18:29:40] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, 10Epic: Investigate: How to make the GUC query performant - https://phabricator.wikimedia.org/T355672 (10MusikAnimal) I didn't elaborate on IP ranges, but doing that is pretty fast as-is, by simply querying `ip... [18:33:20] 10Tool-Pageviews: Massviews is creating URLs which cannot be used - https://phabricator.wikimedia.org/T357087 (10MusikAnimal) p:05Triage→03High [18:33:49] 10Tool-Pageviews: Massviews is creating URLs which cannot be used - https://phabricator.wikimedia.org/T357087 (10MusikAnimal) >>! In T357087#9531455, @Vahurzpu wrote: > I'm having trouble setting up a dev environment on my local machine, but I'm fairly confident that the problem here is with https://github.com/M... [18:41:46] 10Grid-Engine-to-K8s-Migration: Migrate women-in-red from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320183 (10Ragesoss) @dcaro I just disabled the cron. [18:42:00] 10Grid-Engine-to-K8s-Migration, 10Growth-Team: Migrate ERANBOT project off of Grid Engine - https://phabricator.wikimedia.org/T306888 (10MusikAnimal) >>! In T306888#9531153, @eranroz wrote: > Beside copyright bot /copypatrol /plagia bot - all jobs of the bot were moved to new toolforge-jobs . > I think we can... [18:46:17] 10Toolforge, 10cloud-services-team: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098 (10bd808) [18:53:52] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] set gtid_domain_id to 0 - https://phabricator.wikimedia.org/T357341 (10fnegri) [18:54:00] 10Toolforge: rm'ing a specific file on NFS hangs on (dev|login).toolforge.org - https://phabricator.wikimedia.org/T357340 (10Count_Count) [18:54:43] 10Data-Services, 10cloud-services-team: ToolsDB: discard obsolete GTID domains - https://phabricator.wikimedia.org/T334947 (10fnegri) [18:54:46] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] set gtid_domain_id to 0 - https://phabricator.wikimedia.org/T357341 (10fnegri) [18:55:38] 10Toolforge: rm'ing a specific file on NFS hangs on (dev|login).toolforge.org - https://phabricator.wikimedia.org/T357340 (10bd808) ` $ ssh root@tools-nfs-2.tools.eqiad1.wikimedia.cloud $ cd /srv/tools/project/xlinks $ file xlinks xlinks: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linke... [18:59:45] 10Toolforge: rm'ing a specific file on NFS hangs on (dev|login).toolforge.org - https://phabricator.wikimedia.org/T357340 (10bd808) Things seem to hang in the same way as {T357098}: `lang=shell-session root@tools-nfs-2:/srv/tools/project/xlinks# rm xlinks & [1] 3894371 root@tools-nfs-2:/srv/tools/project/xlinks#... [19:01:40] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Goal: [toolsdb] test creating a new replica host - https://phabricator.wikimedia.org/T344717 (10fnegri) While taking a Cinder snapshot as MariaDB is running //seems// to work (MariaDB will fix corrupted tables when restoring the snapshot), the [of... [19:09:58] 10Data-Services, 10cloud-services-team (FY2023/2024-Q3-Q4): [toolsdb] set gtid_domain_id to 0 - https://phabricator.wikimedia.org/T357341 (10fnegri) p:05Triage→03Medium [19:25:01] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12 - https://phabricator.wikimedia.org/T357264 (10fnegri) 05In progress→03Resolved Replication lag is back to zero: {F41... [19:28:20] 10Toolforge: Cannot delete directory from incolabot project on Toolforge - https://phabricator.wikimedia.org/T357342 (10Incola) [19:50:18] 10Tool-Global-user-contributions, 10Stewards-and-global-tools, 10Temporary accounts, 10XTools, and 2 others: [Design] Prototype and user testing plan - https://phabricator.wikimedia.org/T356099 (10KColeman-WMF) [20:22:42] (CloudVPSDesignateLeaks) firing: (2) Detected 33 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:27:42] (CloudVPSDesignateLeaks) resolved: (2) Detected 33 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:01:49] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:01:49] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resources on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:28:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:38:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:40:01] (NovafullstackSustainedFailures) firing: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [21:43:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:48:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:03:56] (SystemdUnitDown) firing: (2) The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:08:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:11:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:21:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:26:56] (SystemdUnitDown) firing: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:31:56] (SystemdUnitDown) resolved: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudweb1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown