[00:06:03] (TfInfraTestDestroyFailed) resolved: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:07:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:12:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:35:46] 10Toolforge (Toolforge iteration 02): [envvars-cli] use toolforge-weld for error handling - https://phabricator.wikimedia.org/T351459 (10Raymond_Ndibe) 05Open→03In progress [00:35:50] 10Toolforge (Toolforge iteration 02): [envvars-cli] move pytest from tox to pre-commit - https://phabricator.wikimedia.org/T351476 (10Raymond_Ndibe) 05Open→03In progress [00:52:18] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-cli] use toolforge-weld for error handling - https://phabricator.wikimedia.org/T351459 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/13 [envvars-cli] use toolforge-weld f... [01:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:12:58] 10Grid-Engine-to-K8s-Migration: Migrate fountain-test from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319751 (10Leloiandudu) 05Open→03Resolved Done. [01:14:27] 10Grid-Engine-to-K8s-Migration: Migrate fountain from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319750 (10Leloiandudu) 05Open→03Resolved Done. [01:14:49] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) P53530 [01:25:11] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) Here's the change in errors on /dev/sdj since the 31st. ` 4c4 < (1) cloudcephosd1024.eqiad.wmnet 198 Of... [02:02:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [02:07:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [02:58:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:03:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:05:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:10:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:15:58] 10Tool-spacemedia, 10Accessibility: Links do not contrast well with cell backgrounds - https://phabricator.wikimedia.org/T351482 (10Remagoxer) [03:16:26] 10Tool-spacemedia, 10Accessibility: Links do not contrast well with cell backgrounds - https://phabricator.wikimedia.org/T351482 (10Remagoxer) [03:17:34] 10Tool-spacemedia, 10Accessibility: Links do not contrast well with cell backgrounds - https://phabricator.wikimedia.org/T351482 (10Remagoxer) [03:34:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:39:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [05:27:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:32:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:44:03] 10Grid-Engine-to-K8s-Migration: Migrate wikitasks from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320168 (10Vort) Run, which executed this night, was successful. So if nothing wrong happens in following days, this task will be closed. [06:38:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [06:43:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [07:07:10] 10Grid-Engine-to-K8s-Migration: Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320062 (10Legoktm) 05Open→03Resolved a:03Legoktm Probably me - I had merged the changes in Git but not yet deployed them. But now it's resolved! ` tools.steve-adder@to... [07:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:28:15] 10Toolforge, 10cloud-services-team, 10Puppet (Puppet 7.0): Migrate Toolforge to Puppet 7 - https://phabricator.wikimedia.org/T351494 (10taavi) [08:29:02] (03CR) 10Stevemunene: [C: 03+2] Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [08:29:13] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new druid101[0-1] [labs/private] - 10https://gerrit.wikimedia.org/r/965460 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [08:38:19] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby on rails tool using build service - https://phabricator.wikimedia.org/T347402 (10Slst2020) >>! In T347402#9339543, @bd808 wrote: >>>! In T347402#9333908, @Slst2020 wrote: >> It is painfully slow though, which... [09:16:37] (CephSlowOps) firing: Ceph cluster in eqiad has 7 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:16:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [09:21:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 7 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:26:46] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team, 10Puppet (Puppet 7.0): Restrict creation of new Debian Buster VMs - https://phabricator.wikimedia.org/T351499 (10taavi) [09:27:12] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team, 10Puppet (Puppet 7.0): Restrict creation of new Debian Buster VMs - https://phabricator.wikimedia.org/T351499 (10taavi) [09:27:19] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team: Cloud-vps Buster deprecation - https://phabricator.wikimedia.org/T331738 (10taavi) [09:27:22] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team, 10Puppet (Puppet 7.0): Restrict creation of new Debian Buster VMs - https://phabricator.wikimedia.org/T351499 (10taavi) a:03taavi [09:32:38] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team, 10Puppet (Puppet 7.0): Restrict creation of new Debian Buster VMs - https://phabricator.wikimedia.org/T351499 (10taavi) Toolforge might still need new Buster-based k8s workers until I get them running on Bookworm. So, following https://wikitech.... [09:39:59] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team, 10Puppet (Puppet 7.0): Restrict creation of new Debian Buster VMs - https://phabricator.wikimedia.org/T351499 (10taavi) 05Open→03Resolved [09:40:01] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate Cloud VPS puppet infrastructure to Puppet 7 - https://phabricator.wikimedia.org/T351450 (10taavi) [09:40:03] 10Cloud-VPS (Debian Buster Deprecation), 10cloud-services-team: Cloud-vps Buster deprecation - https://phabricator.wikimedia.org/T331738 (10taavi) [10:01:25] 10cloud-services-team, 10Observability-Metrics: Evaluate whether to deploy cloud Prometheus instance to codfw - https://phabricator.wikimedia.org/T350010 (10taavi) I think we should do it. [10:02:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:05:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:07:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:24:58] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) Please know an important update of wmfbackups package for compatibility with Puppet 7 will be pushed soon (wmfbackups 0.8.3 - a... [10:37:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:37:33] 10Data-Services, 10cloud-services-team, 10Data-Persistence-Backup, 10database-backups: migrate clouddb backups (openstack) from the old mysqldump system to the new wmfbackups (mydumper/mariabackup) - https://phabricator.wikimedia.org/T284483 (10jcrespo) [10:37:37] 10Cloud-VPS, 10cloud-services-team: Make a script to backup galera/openstack databases - https://phabricator.wikimedia.org/T316664 (10jcrespo) [10:37:53] 10Cloud-VPS, 10cloud-services-team: cloudservices: codfw1dev: fix backups - https://phabricator.wikimedia.org/T339894 (10jcrespo) [10:38:39] 10Cloud-VPS, 10cloud-services-team: cloudservices: codfw1dev: fix backups - https://phabricator.wikimedia.org/T339894 (10jcrespo) I belive the work on those tickets was done here CC @fnegri But please double check if there are additonal dbs that were not part of the migration. [10:42:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:45:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:50:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:57:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:02:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:04:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:09:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:16:29] 10Cloud-Services, 10collaboration-services: VMs in Wikimedia Cloud share the same machine-id - https://phabricator.wikimedia.org/T351507 (10Jelto) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it wi... [11:17:18] 10Cloud-VPS, 10collaboration-services: VMs in Wikimedia Cloud share the same machine-id - https://phabricator.wikimedia.org/T351507 (10Jelto) [11:21:42] 10Cloud-VPS, 10cloud-services-team, 10collaboration-services: VMs in Cloud VPS share the same machine-id - https://phabricator.wikimedia.org/T351507 (10taavi) [11:22:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:37:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:40:01] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Documentation, 10Puppet (Puppet 7.0): Update Wikitech documentation on per-project Puppet servers - https://phabricator.wikimedia.org/T351509 (10taavi) [11:42:33] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Build new Bullseye and Bookworm base images with Puppet 7 - https://phabricator.wikimedia.org/T351510 (10taavi) [11:42:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:42:42] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate Cloud VPS central puppet server to Puppet 7 - https://phabricator.wikimedia.org/T351451 (10taavi) [11:42:44] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Build new Bullseye and Bookworm base images with Puppet 7 - https://phabricator.wikimedia.org/T351510 (10taavi) [12:02:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:07:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:23:29] 10PAWS: PAWS terraform to pulumi? - https://phabricator.wikimedia.org/T345387 (10rook) I think we're going to stick with opentofu T351249 [12:23:40] 10PAWS: PAWS terraform to pulumi? - https://phabricator.wikimedia.org/T345387 (10rook) 05Open→03Resolved [12:23:52] 10PAWS: PAWS terraform to pulumi? - https://phabricator.wikimedia.org/T345387 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/329 [12:23:57] vivian-rook closed https://github.com/toolforge/paws/pull/329 [12:24:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:29:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:42:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:47:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:51:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:56:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:04:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:09:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:19:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1030 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:19:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1025 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:19:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1027 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:19:45] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1028 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:19:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:19:54] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1029 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:19:59] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1026 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [13:23:12] 10Toolforge (Toolforge iteration 02): [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10Slst2020) [13:24:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:26:44] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/25 cli: Add image-name... [13:29:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:31:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:31:38] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/57 [build.start]: Add... [13:36:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:41:04] 10Tools, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "labs/tools/SuchABot" (20130812) - https://phabricator.wikimedia.org/T351526 (10Aklapper) [13:41:08] 10Tool-anagrimes, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "wiktionary/anagrimes" (20131123) - https://phabricator.wikimedia.org/T351528 (10Aklapper) [13:47:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:48:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:52:33] valhallasw, 👋 [13:52:43] Hello! [13:52:47] the repo is this one, right? https://gerrit.wikimedia.org/r/admin/repos/labs/tools/wikibugs2,general [13:52:56] Yes [13:54:35] alright, so i uploaded this: https://gerrit.wikimedia.org/r/c/labs/tools/wikibugs2/+/975276 [13:54:57] and will mark it as "ready for review" now, which should trigger the message here *fingers crossed* [13:55:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:55:11] (03CR) 10Jon Harald Søby: "This change is ready for review." [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/975276 (owner: 10Jon Harald Søby) [13:55:11] Not you, wmcs-alerts :D [13:55:18] There we go! [13:55:28] there [13:55:44] .... {"type":"Verified","description":"Verified","value":"0"},{"type":"Code-Review","description":"Code-Review","value":"0"}],"comment":"Patch Set 1:\n\nThis change is ready for review.","patchSet": [13:55:51] Ok, that gives me enough to continue digging with [13:55:58] nice [13:56:17] I'm a bit confused why it's reporting this in a slightly different way, but such is life [13:56:37] (03Abandoned) 10Jon Harald Søby: Test commit [DO NOT MERGE] [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/975276 (owner: 10Jon Harald Søby) [13:56:56] valhallasw, hehe, yeah [13:57:43] valhallasw, the setting i have enabled in Gerrit is under https://gerrit.wikimedia.org/r/settings/ and called "Set new changes to "work in progress" by default", if you need to test any further [13:59:20] And the 'move to review' is the regular button through https://phab.wmfusercontent.org/file/data/kgklzar4xsiohpntgd2g/PHID-FILE-ysiloqxgt5hwqeeotdc2/preview-afbeelding.png ? [13:59:32] (three dots menu) [14:00:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:00:26] Aha. There is *both* a wip-state-changed *and* a message [14:03:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:05:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:06:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:06:36] valhallasw, my usual procedure is to just click the big blue "Start review" button (where the "Reply" button normally is) [14:07:54] Ah, wait, the screenshot is to go *to* WIP not from. Ok, not entirely sure what's going on there but will poke around a bit further. [14:08:49] button looks like this: https://phab.wmfusercontent.org/file/data/uvfi5ikfuwptepqwpo77/PHID-FILE-xy6c7qbdla4rmo3vfw5d/Screenshot_20231117_150807.png [14:09:13] i think it will appear for you too here, for example, even if you're not the patch owner? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/965108 [14:10:27] No, I only see a "Reply" button [14:11:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:11:15] But I can make my own changes following your instructions :-) [14:16:48] 10Wikibugs: Better message than "This change is ready for review" when patch stops being WIP - https://phabricator.wikimedia.org/T350778 (10valhallasw) Did an online test with Jon and we were able to reproduce and log the issue. See attached (anonymized) {F41513945}, resulting in the following IRC entry: `<+wik... [14:43:39] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10Slst2020) 05Open→03In progress [14:44:59] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: maintain-harbor: code refactor for readability and quality - https://phabricator.wikimedia.org/T351277 (10Slst2020) [14:48:50] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/57 [build.start]: Add... [14:49:18] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10CodeReviewBot) sstefanova merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/25 cli: Add image-name... [14:53:03] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10Slst2020) 05In progress→03Resolved [14:53:41] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/26 d/changelog: bump to 0.0.5 [14:55:42] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tbs.builds][api, cli] Bring back the ability to specify an image name - https://phabricator.wikimedia.org/T351516 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/26 d/changelog: bump to 0.0.5 [14:56:05] !log taavi@cloudcumin2001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:56:22] !log taavi@cloudcumin2001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [14:57:44] !log taavi@cloudcumin2001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [14:57:59] !log taavi@cloudcumin2001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [15:02:05] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for all nodes [15:03:42] !log taavi@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for all nodes [15:06:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:16:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:18:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1025 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:18:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1026 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:18:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:18:45] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1027 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:18:45] 10cloud-services-team: NeutronAgentDown cloudvirt1030 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351535 (10phaultfinder) [15:18:48] 10cloud-services-team: NeutronAgentDown cloudvirt1029 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351540 (10phaultfinder) [15:18:50] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1030 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:18:50] 10cloud-services-team: NeutronAgentDown cloudvirt1027 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351537 (10phaultfinder) [15:18:52] 10cloud-services-team: NeutronAgentDown cloudvirt1046 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351538 (10phaultfinder) [15:18:54] 10cloud-services-team: NeutronAgentDown cloudvirt1026 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351536 (10phaultfinder) [15:18:54] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1028 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:18:56] 10cloud-services-team: NeutronAgentDown cloudvirt1025 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351539 (10phaultfinder) [15:18:58] 10cloud-services-team: NeutronAgentDown cloudvirt1028 A Neutron agent is down, VMs will have connectivity issues - https://phabricator.wikimedia.org/T351541 (10phaultfinder) [15:18:59] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1029 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [15:30:34] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [15:35:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:49:51] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component builds-api [15:50:11] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component builds-api [15:50:49] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.apt.copy_to_main_repo for package 'toolforge-builds-cli' version '0.0.5' [15:51:06] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.apt.copy_to_main_repo (exit_code=0) for package 'toolforge-builds-cli' version '0.0.5' [16:08:07] 10Grid-Engine-to-K8s-Migration, 10User-bd808: Migrate officewikibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319934 (10taavi) [16:08:09] 10Grid-Engine-to-K8s-Migration: Migrate multichill from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319912 (10taavi) [16:09:00] 10Cloud-VPS, 10cloud-services-team, 10collaboration-services: VMs in Cloud VPS share the same machine-id - https://phabricator.wikimedia.org/T351507 (10Dzahn) https://review.opendev.org/c/openstack/tripleo-puppet-elements/+/445173 https://review.opendev.org/c/openstack/tripleo-common/+/445174 [16:09:54] 10Toolforge Jobs framework, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Pywikibot, 10Patch-For-Review, 10User-Raymond_Ndibe: Create Docker image for Toolforge that is purpose built to run pywikibot scripts - https://phabricator.wikimedia.org/T249787 (10taavi) 05Open→03Resolved Image published, and I... [16:10:59] 10Cloud-VPS, 10cloud-services-team, 10collaboration-services: VMs in Cloud VPS share the same machine-id - https://phabricator.wikimedia.org/T351507 (10taavi) >>! In T351507#9341302, @Dzahn wrote: > https://review.opendev.org/c/openstack/tripleo-puppet-elements/+/445173 > https://review.opendev.org/c/opensta... [16:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:26:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:31:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:32:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [16:36:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [17:01:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:06:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:19:55] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [17:26:47] 10Grid-Engine-to-K8s-Migration, 10User-bd808: Migrate officewikibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319934 (10RhinosF1) 05Stalled→03Open [17:31:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) a:05fnegri→03RobH Rob will have a go at 1046. @RobH if you get it to reimage reassign this ticket to me so I can put it in service. Thanks! [18:24:21] 10Grid-Engine-to-K8s-Migration, 10User-bd808: Migrate officewikibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319934 (10bd808) > @RhinosF1 changed the task status from Stalled to Open. The task I blocked on did resolve, but unfortunately it resolved in a way that i... [18:27:28] 10Grid-Engine-to-K8s-Migration, 10User-bd808: Migrate officewikibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319934 (10bd808) p:05Triage→03High [18:28:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:33:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:01:38] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:15:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:18:55] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [19:20:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:23:44] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:31:09] (03PS1) 10Majavah: Also notify Toolforge on new Pywikibot releases [labs/libraryupgrader/config] - 10https://gerrit.wikimedia.org/r/975357 [19:35:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:45:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:55:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:15:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:20:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:26:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:31:02] PROBLEM - SSH on cloudvirt1058 is CRITICAL: connect to address 10.64.149.10 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:32:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1058 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:32:45] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:34:03] (InstanceDown) firing: (2) Project tools instance tools-k8s-control-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:34:03] (InstanceDown) firing: Project metricsinfra instance metricsinfra-grafana-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:34:03] (InstanceDown) firing: Project gitlab-runners instance runner-1021 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:34:25] (NodeDown) firing: The node cloudvirt1058 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1058 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:34:25] (NodeDown) firing: #page The cloudvirt node cloudvirt1058 is unreachable. This is a [20:34:31] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10phaultfinder) [20:35:03] (InstanceDown) firing: Project quarry instance quarry-nfs-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:35:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-test-k8s-ingress-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:35:03] (InstanceDown) firing: Project clouddb-services instance clouddb-wikireplicas-proxy-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:35:28] PROBLEM - Host cloudvirt1058 is DOWN: PING CRITICAL - Packet loss = 100% [20:36:03] (WidespreadInstanceDown) firing: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:36:03] (WidespreadInstanceDown) firing: Widespread instances down in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:36:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:37:22] RECOVERY - Host cloudvirt1058 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [20:37:56] RECOVERY - SSH on cloudvirt1058 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:38:37] (InterfaceSpeedError) firing: brq7425e328-56 on cloudvirt1058:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [20:38:41] 10cloud-services-team: InterfaceSpeedError cloudvirt1058:9100 brq7425e328-56 on cloudvirt1058:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T351571 (10phaultfinder) [20:39:03] (InstanceDown) firing: (3) Project gitlab-runners instance gitlab-runners-puppetmaster-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:39:03] (InstanceDown) firing: (7) Project tools instance tools-k8s-control-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:39:24] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:39:25] (NodeDown) resolved: The node cloudvirt1058 is unreachable. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1058 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [20:39:25] (NodeDown) resolved: #page The cloudvirt node cloudvirt1058 is unreachable. This is a [20:40:03] (InstanceDown) firing: (3) Project quarry instance quarry-dbbackup-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:40:03] (InstanceDown) resolved: Project clouddb-services instance clouddb-wikireplicas-proxy-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:03] (WidespreadInstanceDown) resolved: Widespread instances down in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:42:45] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:43:37] (InterfaceSpeedError) resolved: brq7425e328-56 on cloudvirt1058:9100 has the wrong speed: 1.25e+06. - https://wikitech.wikimedia.org/wiki/Monitoring/check_eth - https://grafana.wikimedia.org/d/000000562 - https://alerts.wikimedia.org/?q=alertname%3DInterfaceSpeedError [20:44:03] (InstanceDown) firing: (3) Project gitlab-runners instance gitlab-runners-puppetmaster-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:44:03] (InstanceDown) resolved: (7) Project tools instance tools-k8s-control-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:44:03] (InstanceDown) resolved: Project metricsinfra instance metricsinfra-grafana-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:44:46] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:45:03] (InstanceDown) resolved: (3) Project quarry instance quarry-dbbackup-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:03] (InstanceDown) resolved: Project toolsbeta instance toolsbeta-test-k8s-ingress-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:16] 10cloud-services-team: InterfaceSpeedError cloudvirt1058:9100 brq7425e328-56 on cloudvirt1058:9100 has the wrong speed: 1.25e+06. - https://phabricator.wikimedia.org/T351571 (10taavi) 05Open→03Resolved a:03taavi [20:45:18] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T350998 (10taavi) 05Open→03Resolved a:03taavi [20:46:03] (WidespreadInstanceDown) resolved: Widespread instances down in project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:49:03] (InstanceDown) resolved: (3) Project gitlab-runners instance gitlab-runners-puppetmaster-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:57:40] (NeutronAgentDown) resolved: Neutron neutron-linuxbridge-agent on cloudvirt1058 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:24:40] (NeutronAgentDown) firing: Neutron neutron-linuxbridge-agent on cloudvirt1046 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:29:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:34:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:44:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:54:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:42:13] 10Grid-Engine-to-K8s-Migration, 10User-bd808: Migrate officewikibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319934 (10bd808) 05Open→03In progress [22:46:19] 10Grid-Engine-to-K8s-Migration, 10User-bd808: Migrate officewikibot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319934 (10bd808) Work in progress at https://gitlab.wikimedia.org/toolforge-repos/officewikibot-pywikibot/-/tree/work/bd808/botpassword Currently trying to... [23:35:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse