[00:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:46:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:53:01] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [01:05:22] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:08:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:43:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:18:15] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:31:21] 10Toolforge (Quota-requests): Request increased quota for dexbot Toolforge tool - https://phabricator.wikimedia.org/T351051 (10Ladsgroup) [03:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [04:46:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:08:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:33:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:43:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:18:15] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [07:01:46] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:13:36] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:46:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:46:19] 10Toolforge (Quota-requests): Request increased quota for dexbot Toolforge tool - https://phabricator.wikimedia.org/T351051 (10taavi) a:03taavi The CPU increase is within the new defaults from T333979 that haven't been fully rolled out to existing tools yet, so I did that immediately. I'll need to find someone... [08:51:47] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [08:53:00] (PuppetFailure) resolved: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:56:33] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [09:08:00] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [09:08:41] 10Cloud-VPS, 10cloud-services-team: Instance deletion times out in codfw1dev - https://phabricator.wikimedia.org/T351061 (10taavi) [09:08:57] 10Cloud-VPS, 10cloud-services-team: Instance deletion times out in codfw1dev - https://phabricator.wikimedia.org/T351061 (10taavi) ` Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID... [09:09:06] 10Toolforge (Toolforge iteration 02), 10Documentation: Create an ASGI tutorial for buildservice - https://phabricator.wikimedia.org/T350692 (10Slst2020) a:03Slst2020 [09:11:58] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [09:27:02] 10Cloud-VPS, 10cloud-services-team: Instance deletion times out in codfw1dev - https://phabricator.wikimedia.org/T351061 (10taavi) That didn't help. I'm trying to restart Galera on cloudcontrol2005-dev, but it seems to be struggling to re-join the cluster. [09:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:33:53] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:40:10] 10wikitech.wikimedia.org, 10MediaWiki-Blocks, 10MediaWiki-extensions-OAuth, 10MW-1.42-notes (1.42.0-wmf.5; 2023-11-14), and 2 others: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 (10jnuche) 05Open→03Resolved [10:13:16] 10Data-Services, 10cloud-services-team: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) >>! In T300427#7661326, @taavi wrote: > I was briefly considering using a [[ https://wikitech.wikimedia.org/wiki/Conftool | conftool ]]-based solution to manage the pooling on... [10:31:26] 10Cloud-VPS, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Create OpenID Connect client - https://phabricator.wikimedia.org/T350725 (10SLyngshede-WMF) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile... [10:31:46] 10Cloud-VPS, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Create OpenID Connect client - https://phabricator.wikimedia.org/T350725 (10SLyngshede-WMF) [10:33:19] 10Cloud-VPS, 10cloud-services-team, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Create OpenID Connect client - https://phabricator.wikimedia.org/T350725 (10taavi) a:05SLyngshede-WMF→03taavi [10:35:40] 10Cloud-VPS, 10cloud-services-team, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Create OpenID Connect client - https://phabricator.wikimedia.org/T350725 (10taavi) I've configured the client to idp.wmcloud.org. The client ID is `catalyst` and the client secret is in P53310. [10:44:38] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10BTullis) @taavi - Thanks so much, that does look really helpful. The only other thing I think would be helpful is if we could somehow also remove the spof on... [10:47:12] 10Toolforge (Quota-requests): Request increased quota for dexbot Toolforge tool - https://phabricator.wikimedia.org/T351051 (10fnegri) +1 [10:55:11] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) My change doesn't immediately fix the proxy redundancy issue, but it definitely makes it much easier to solve.as all of the backend configuration will a... [11:00:27] 10Toolforge (Quota-requests): Request increased quota for dexbot Toolforge tool - https://phabricator.wikimedia.org/T351051 (10taavi) 05Open→03Resolved Bumped the continuous job quota too, you should be all set. [11:01:22] 10Cloud-VPS, 10cloud-services-team, 10Observability-Metrics: Current status of cloudmetrics and its components - https://phabricator.wikimedia.org/T336774 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudmetrics[1003-1004].eqiad.wmnet` - cloudmetrics1003.eq... [11:02:52] 10Cloud-VPS, 10cloud-services-team, 10decommission-hardware: decommission cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10taavi) [11:12:06] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) Or we could use the opportunity to do both changes at the same time, and also combine it with moving the load balancing to our new `cloudlb` setup and r... [11:25:02] 10Cloud-VPS, 10cloud-services-team, 10CAS-SSO, 10Infrastructure-Foundations, 10Patch-For-Review: Create OpenID Connect client - https://phabricator.wikimedia.org/T350725 (10taavi) 05In progress→03Resolved [11:25:06] 10Toolforge (Quota-requests): Request increased quota for dexbot Toolforge tool - https://phabricator.wikimedia.org/T351051 (10Ladsgroup) Thanks! [11:43:58] PROBLEM - Host clouddb1015 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:06] RECOVERY - Host clouddb1015 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:46:08] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s4.service,wmf-pt-kill@s6.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:20] PROBLEM - mysqld processes on clouddb1015 is CRITICAL: PROCS CRITICAL: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:47:34] RECOVERY - mysqld processes on clouddb1015 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:48:29] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:52:24] RECOVERY - Check systemd state on clouddb1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:03] (InstanceDown) firing: Project project-proxy instance proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:04:03] (InstanceDown) resolved: Project project-proxy instance proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:30:52] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10BTullis) >>! In T300427#9326131, @taavi wrote: > Or we could use the opportunity to do both changes at the same time, and also combine it... [12:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:34:51] 10PAWS: New upstream release 8.5.1 for Pywikibot - https://phabricator.wikimedia.org/T351015 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/347 [12:34:58] vivian-rook opened https://github.com/toolforge/paws/pull/347 [12:38:14] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) >>! In T300427#9326322, @BTullis wrote: > That would mean: > * integrating the work on this ticket, correct? {T346947} > * whilst... [12:46:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:02:18] 10Toolforge (Toolforge iteration 02), 10Documentation: Create an ASGI tutorial for buildservice - https://phabricator.wikimedia.org/T350692 (10Slst2020) Tutorial: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service/My_first_Buildpack_Python_ASGI_tool Repo: https://gitlab.wikimedia.org/toolforge-r... [13:08:51] 10Toolforge (Toolforge iteration 02), 10Documentation: Create an ASGI tutorial for buildservice - https://phabricator.wikimedia.org/T350692 (10Slst2020) 05Open→03Resolved [13:08:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:12:46] 10Cloud-VPS, 10cloud-services-team, 10CAS-SSO, 10Infrastructure-Foundations: Create OpenID Connect client - https://phabricator.wikimedia.org/T350725 (10CCicalese_WMF) Works perfectly! Thank you! [13:13:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:18:58] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on compiling static frontend assets at build time - https://phabricator.wikimedia.org/T351082 (10Slst2020) [13:21:37] 10PAWS: New upstream release 8.5.1 for Pywikibot - https://phabricator.wikimedia.org/T351015 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/347 [13:21:42] vivian-rook closed https://github.com/toolforge/paws/pull/347 [13:22:00] 10PAWS: New upstream release 8.5.1 for Pywikibot - https://phabricator.wikimedia.org/T351015 (10rook) 05Open→03Resolved a:03rook [13:23:25] 10Toolforge (Toolforge iteration 02), 10Technical-blog-posts: Publish a blog post about buildservice on the Tech Blog - https://phabricator.wikimedia.org/T350691 (10Slst2020) [13:31:44] 10Toolforge (Toolforge iteration 02): [tbs] migrate sample tools to Gitlab - https://phabricator.wikimedia.org/T348213 (10Slst2020) [13:47:13] 10Toolforge (Toolforge iteration 02): [tbs] migrate sample tools to Gitlab - https://phabricator.wikimedia.org/T348213 (10Slst2020) 05Open→03In progress [14:05:47] 10Cloud-VPS, 10cloud-services-team: Migrate cloudlb hosts to nftables - https://phabricator.wikimedia.org/T351087 (10taavi) [14:08:10] 10Data-Services, 10cloud-services-team, 10Patch-For-Review: Move wiki replicas behind cloudlb - https://phabricator.wikimedia.org/T346947 (10taavi) [14:08:12] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE, 10Patch-For-Review: Automate maintain-views replica depooling - https://phabricator.wikimedia.org/T300427 (10taavi) [14:25:31] 10Toolforge (Toolforge iteration 02): [tbs] migrate sample tools to Gitlab - https://phabricator.wikimedia.org/T348213 (10Slst2020) [14:26:15] 10Toolforge (Toolforge iteration 02): [tbs] migrate sample tools to Gitlab - https://phabricator.wikimedia.org/T348213 (10Slst2020) 05In progress→03Resolved [14:27:13] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby tool using build service - https://phabricator.wikimedia.org/T347402 (10Slst2020) [14:27:35] 10Toolforge (Toolforge iteration 02), 10Documentation, 10Kubernetes: Add a easy way to run a ruby webservice on tools - https://phabricator.wikimedia.org/T141388 (10Slst2020) [14:27:52] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby tool using build service - https://phabricator.wikimedia.org/T347402 (10Slst2020) 05Open→03In progress [14:28:03] PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service,wmf-pt-kill@s3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:12] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby on rails tool using build service - https://phabricator.wikimedia.org/T347402 (10Slst2020) [14:35:34] RECOVERY - Check systemd state on clouddb1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:55] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/16 [envvars-api]: Add prometheus [14:37:03] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [builds-api.start] Add statistics - https://phabricator.wikimedia.org/T337390 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repo... [14:39:25] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:45:47] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/129 envvars-api: bump to 0.0.34-20231113143549-be1944fa [14:46:58] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/129 envva... [14:51:51] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Slst2020) [14:57:06] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10tchin) Example error: ` step-export: 2023-11-13T05:41:56.835942824Z ERROR: failed to export: failed to write image to the following tags: [tools-harbor.wmcloud.org/tool-dpe-alerts-das... [14:57:45] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:00:17] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Slst2020) >>! In T351092#9326750, @tchin wrote: > Example error: > ` > step-export: 2023-11-13T05:41:56.835942824Z ERROR: failed to export: failed to write image to the following tags... [15:01:09] 10Data-Services, 10cloud-services-team, 10Infrastructure-Foundations: nftables ignores drange filter for IPv6 if drange only has IPv4 addresses - https://phabricator.wikimedia.org/T351094 (10taavi) [15:03:01] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby on rails tool using build service - https://phabricator.wikimedia.org/T347402 (10Slst2020) [15:19:05] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1031.eqiad.wmnet' (T345811) [15:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:19:11] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [15:26:55] vivian-rook opened https://github.com/toolforge/superset-deploy/pull/10 [15:30:15] vivian-rook closed https://github.com/toolforge/superset-deploy/pull/10 [15:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:37:54] !log admin fran@wmf3169 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1031.eqiad.wmnet' (T345811) [15:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:38:00] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [15:55:47] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1031.eqiad.wmnet with OS bookworm [16:07:13] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.086 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:07:45] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:12:23] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [16:40:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1031.eqiad.wmnet with OS bookworm completed: - cloudvirt1031 (**... [16:55:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:56:04] 10cloud-services-team: PuppetFailure cloudcontrol2005-dev:9100 Puppet failure on cloudcontrol2005-dev:9100 - https://phabricator.wikimedia.org/T351107 (10phaultfinder) [16:56:51] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T345811) [16:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:56:58] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [16:57:00] !log admin fran@wmf3169 END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) (T345811) [16:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:09:15] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T345811) [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:09:21] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [17:09:35] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T345811) [17:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [17:09:43] PROBLEM - ensure kvm processes are running on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:12:01] (03PS2) 10Eevans: cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) [17:20:24] !log fnegri@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:20:54] !log fnegri@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:21:47] RECOVERY - ensure kvm processes are running on cloudvirt1031 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:38:39] (03CR) 10Hnowlan: [C: 03+1] cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [17:38:53] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/129 envva... [17:38:59] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/129 envvars-api: bump to 0.0.34-20231113143549-be1944fa [17:39:35] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [builds-api.start] Add statistics - https://phabricator.wikimedia.org/T337390 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repo... [17:39:38] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [builds-api] catch harbor timeout when creating repository - https://phabricator.wikimedia.org/T345903 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolfor... [18:12:36] 10PAWS: Remove 123_8 cluster - https://phabricator.wikimedia.org/T350875 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/348 [18:12:43] vivian-rook opened https://github.com/toolforge/paws/pull/348 [18:17:58] vivian-rook closed https://github.com/toolforge/paws/pull/348 [18:18:37] 10PAWS: Remove 123_8 cluster - https://phabricator.wikimedia.org/T350875 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/348 [18:18:52] 10PAWS: Remove 123_8 cluster - https://phabricator.wikimedia.org/T350875 (10rook) 05Open→03Resolved [18:27:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1032.eqiad.wmnet' [18:27:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1033.eqiad.wmnet' [18:27:48] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1032.eqiad.wmnet' [18:27:49] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1033.eqiad.wmnet' [18:29:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [18:30:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [18:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:31:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [18:31:50] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [18:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:33:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [18:39:25] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:47:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [18:53:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [19:00:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:01:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [19:01:16] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [19:03:17] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:03:47] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2001 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:06:03] (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:06:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:16:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [19:16:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [19:16:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [19:17:23] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [19:17:28] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [19:18:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [19:20:02] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:20:53] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:35:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [19:35:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:37:57] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [19:38:54] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:38:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm [19:38:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:53:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [19:56:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm [19:59:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [20:00:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm [20:09:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [20:12:59] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:16:33] (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [20:18:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm [20:35:55] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1025.eqiad.wmnet` - cloudvirt1025.eqiad.wmnet (**PASS**) - Dow... [20:37:37] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:37:55] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [20:40:23] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:40:50] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:41:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1032.eqiad.wmnet with OS bookworm completed: - cloudvirt1032 (**... [20:43:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1034.eqiad.wmnet with OS bookworm [20:44:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1026.eqiad.wmnet` - cloudvirt1026.eqiad.wmnet (**PASS**) - Dow... [20:52:19] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1027.eqiad.wmnet` - cloudvirt1027.eqiad.wmnet (**PASS**) - Dow... [20:57:15] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:57:46] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:59:22] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1033.eqiad.wmnet with OS bookworm completed: - cloudvirt1033 (**... [21:00:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [21:00:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [21:00:19] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [21:02:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1028.eqiad.wmnet` - cloudvirt1028.eqiad.wmnet (**PASS**) - Dow... [21:10:38] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1029.eqiad.wmnet` - cloudvirt1029.eqiad.wmnet (**PASS**) - Dow... [21:16:40] 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Andrew) [21:17:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [21:18:32] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1035.eqiad.wmnet with OS bookworm [21:18:53] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1030.eqiad.wmnet` - cloudvirt1030.eqiad.wmnet (**PASS**) - Dow... [21:21:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [21:21:22] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [21:22:46] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:23:08] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [21:24:08] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:24:30] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:25:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1034.eqiad.wmnet with OS bookworm completed: - cloudvirt1034 (**... [21:30:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [21:30:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [21:30:17] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [21:30:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1036.eqiad.wmnet with OS bookworm [21:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:46:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [21:46:56] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [21:54:07] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [22:01:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:05:33] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [22:05:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1035.eqiad.wmnet with OS bookworm completed: - cloudvirt1035 (**... [22:05:52] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [22:06:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:08:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [22:08:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [22:08:56] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [22:09:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [22:09:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [22:10:57] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1037.eqiad.wmnet with OS bookworm [22:11:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [22:11:26] (ProbeDown) firing: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:11:59] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [22:16:26] (ProbeDown) resolved: Service toolserver-proxy-01:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolserver-proxy-01:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:17:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1036.eqiad.wmnet with OS bookworm completed: - cloudvirt1036 (**... [22:23:59] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1038.eqiad.wmnet with OS bookworm [22:24:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [22:24:45] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [22:25:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345811) [22:59:49] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1037.eqiad.wmnet with OS bookworm completed: - cloudvirt1037 (**... [23:10:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1038.eqiad.wmnet with OS bookworm completed: - cloudvirt1038 (**... [23:38:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:53:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse