[00:18:00] 10Toolforge (Toolforge iteration 02): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10Raymond_Ndibe) [00:18:53] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10Raymond_Ndibe) 05In progress→03Resolved [00:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:13:15] 10Cloud-VPS (Project-requests), 10WMF-Communications: Request creation of foundationmemory VPS project - https://phabricator.wikimedia.org/T350760 (10Varnent) >>! In T350760#9316993, @Andrew wrote: > All set! Wonderful!! Thank you! That was fast. Greatly appreciate it. :) [03:18:13] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:41:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:43:12] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:32:39] PROBLEM - Check unit status of backup_cinder_volumes on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [07:08:45] RECOVERY - Check unit status of backup_cinder_volumes on cloudbackup2002 is OK: OK: Status of the systemd unit backup_cinder_volumes https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:18:13] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:41:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:43:13] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [10:02:31] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/5 Make... [10:37:49] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/126 maint... [10:37:57] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/6 Al... [10:38:36] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/126 maint... [10:39:05] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [10:39:19] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [10:50:56] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10daniel) > The ruprecht tool is still running on the grid engine. If it's no longer used, please stop the grid web services and/or properly d... [11:17:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [11:18:52] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:22:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [11:30:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Two more OOM crashes in the last two days :/ ` Nov 08 11:08:55 tools-db-1 systemd[1]: mariadb.service: M... [11:38:52] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcontrol2006-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:43:52] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:43:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:45:33] 10Tools, 10Privacy: enkore.toolforge.org violates Privacy Policy by loading third-party resources - https://phabricator.wikimedia.org/T348445 (10Aklapper) [11:45:40] 10Toolforge-standards-committee, 10Tools, 10Privacy Engineering, 10Privacy: Hunt for Toolforge tools that load resources from third party sites - https://phabricator.wikimedia.org/T172065 (10Aklapper) [11:48:45] 10Toolforge-standards-committee, 10Tools, 10Privacy Engineering, 10Privacy: Hunt for Toolforge tools that load resources from third party sites - https://phabricator.wikimedia.org/T172065 (10Aklapper) [11:49:06] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component maintain-kubeusers [11:49:20] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component maintain-kubeusers [11:58:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) I have analyzed the slow query log file with [mariadb-dumpslow](https://mariadb.com/kb/en/mariadb-dumpslo... [12:06:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Sorting by total time used, the same query is coming on top: ` Count: 408 Time=1889.12s (770760s) Lock... [12:17:16] 10Toolforge, 10cloud-services-team: [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862 (10fnegri) [12:17:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) I'm not sure those slow queries are causing the OOM errors, but it's one more reason to move `s51434__mix... [12:19:38] 10Toolforge, 10cloud-services-team: [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862 (10fnegri) [12:22:28] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [12:24:28] 10Data-Services, 10cloud-services-team: [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862 (10fnegri) [12:25:06] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) [12:25:26] 10Data-Services, 10Quarry, 10cloud-services-team (FY2023/2024-Q1): Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407 (10fnegri) [12:25:43] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Trove for some ToolsDB users - https://phabricator.wikimedia.org/T291782 (10fnegri) [12:26:23] 10Cloud-VPS, 10Data-Services, 10cloud-services-team, 10User-dcaro: [wmcs-cookbooks] Write cookbook for restarting ToolsDB - https://phabricator.wikimedia.org/T328282 (10fnegri) [12:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [13:06:00] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10taavi) You're looking at the wrong button. The correct one is this one at the bottom of the left sidebar: {F41477519} [13:51:51] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: add pre-commit to maintain-harbor - https://phabricator.wikimedia.org/T350452 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/17 [maintain-harbor] add pr... [14:12:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1013:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:12:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1016:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:14:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1014:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:18:41] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: add pre-commit to maintain-harbor - https://phabricator.wikimedia.org/T350452 (10Raymond_Ndibe) 05In progress→03Resolved [14:19:21] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Patch-For-Review: track and apply Toolforge quota changes via a Git repository - https://phabricator.wikimedia.org/T324558 (10taavi) [14:19:51] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: Automatically apply quota changes to existing tools - https://phabricator.wikimedia.org/T350873 (10taavi) [14:19:59] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10taavi) 05In progress→03Resolved [14:20:15] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: Automatically apply quota changes to existing tools - https://phabricator.wikimedia.org/T350873 (10taavi) [14:20:31] 10Toolforge (Toolforge iteration 02), 10cloud-services-team: Re-visit Toolforge Kubernetes default quotas (April 2023) - https://phabricator.wikimedia.org/T333979 (10taavi) [14:24:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1018:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:32:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1017:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:33:45] 10PAWS: Cannot deploy ingress-nginx chart 4.8.0 - https://phabricator.wikimedia.org/T347506 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/346 [14:33:56] vivian-rook closed https://github.com/toolforge/paws/pull/346 [14:35:11] 10PAWS: Remove 123_8 cluster - https://phabricator.wikimedia.org/T350875 (10rook) [14:35:38] 10PAWS: Remove 123_8 cluster - https://phabricator.wikimedia.org/T350875 (10rook) [14:35:40] 10PAWS: Cannot deploy ingress-nginx chart 4.8.0 - https://phabricator.wikimedia.org/T347506 (10rook) [14:40:16] 10PAWS: Cannot deploy ingress-nginx chart 4.8.0 - https://phabricator.wikimedia.org/T347506 (10rook) 05Open→03Resolved a:03rook [14:40:22] 10PAWS: Remove 123_8 cluster - https://phabricator.wikimedia.org/T350875 (10rook) [14:50:17] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [14:50:23] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [14:50:57] !log fnegri@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345811) [14:52:36] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1025.eqiad.wmnet' (T345811) [14:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:54:05] !log admin fran@wmf3169 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1025.eqiad.wmnet' (T345811) [14:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:56:04] (03PS1) 10FNegri: openstack.cloudvirt.drain: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/973175 [14:56:44] (03CR) 10Majavah: [C: 03+1] openstack.cloudvirt.drain: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/973175 (owner: 10FNegri) [15:10:04] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Magnus) FWIW I just saw this ticket, and added an index to the `log` table which should speed things up. I ca... [15:12:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bookworm [15:16:30] 10Data-Services, 10cloud-services-team: [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862 (10Magnus) I am open to this, as I already have a Trove instance for baglama2, and it works well (after some initial problems). The difference here is that for baglama2, I (that is, a sc... [15:17:52] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [15:18:05] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [15:20:04] 10Cloud-VPS, 10Content-Transform-Team, 10Parsoid, 10Parsoid-Read-Views: upgrade nodejs on parsing-qa-02 - https://phabricator.wikimedia.org/T349941 (10MSantos) Removing #serviceops since it's in the Cloud VPS project. [15:23:30] 10Cloud-VPS, 10Content-Transform-Team-WIP, 10Parsoid, 10Parsoid-Read-Views, 10Maintenance-Worktype: upgrade nodejs on parsing-qa-02 - https://phabricator.wikimedia.org/T349941 (10MSantos) p:05Triage→03Low [15:26:37] 10VPS-Projects, 10Content-Transform-Team-WIP, 10Parsoid, 10Parsoid-Read-Views, 10Maintenance-Worktype: upgrade nodejs on parsing-qa-02 - https://phabricator.wikimedia.org/T349941 (10taavi) [15:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:36:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1020:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:43:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:43:53] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:59:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bookworm completed: - cloudvirt1025 (**... [16:42:45] 10Tool-phab-ban: Link to audit log wiki page from UI - https://phabricator.wikimedia.org/T350890 (10bd808) [16:43:03] 10Tool-phab-ban: Link to audit log wiki page from UI - https://phabricator.wikimedia.org/T350890 (10bd808) p:05Triage→03Medium [16:44:21] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10daniel) >>! In T320021#9319170, @taavi wrote: > You're looking at the wrong button. The correct one is this one at the bottom of the left si... [16:44:52] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10daniel) 05Open→03Declined I disabled the tool. [16:50:01] 10Tool-phab-ban: some log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10bd808) [16:50:23] 10Tool-phab-ban: some log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10bd808) p:05Triage→03Medium [17:03:29] PROBLEM - Check unit status of backup_vms on cloudbackup1004 is CRITICAL: CRITICAL: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms [17:08:33] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:32:51] PROBLEM - ensure kvm processes are running on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:34:10] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott upgrading to bookworm https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:43:53] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:50:10] !log fnegri@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:50:12] !log fnegri@cloudcumin1001 cloudvirt-canary END (ERROR) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=97) [17:50:54] !log fnegri@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (T345811) [17:51:01] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [17:51:18] !log fnegri@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) (T345811) [17:51:59] RECOVERY - ensure kvm processes are running on cloudvirt1025 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:52:50] (03CR) 10FNegri: [C: 03+2] openstack.cloudvirt.drain: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/973175 (owner: 10FNegri) [17:56:05] (03Merged) 10jenkins-bot: openstack.cloudvirt.drain: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/973175 (owner: 10FNegri) [18:04:37] (CephSlowOps) firing: Ceph cluster in eqiad has 112 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [18:04:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [18:08:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:21:41] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Thanks @Magnus, much appreciated. Fingers crossed this index will also help with the OOM errors, thou... [18:24:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 13 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [18:25:32] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) cloudvirt1025 had no VMs running (only the canary) because it's in the "maintenance" aggregate. Using the [for loop described here](https://wikitech.wikimedia.... [18:28:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:36:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) Using Cumin I checked the number of VMs currently running on all cloudvirts: ` fnegri@cloudcumin1001:~$ sudo cumin cloudvirt1* 'count=$(virsh list --uuid | gr... [18:42:23] PROBLEM - ensure kvm processes are running on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:42:33] PROBLEM - ensure kvm processes are running on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:43:13] PROBLEM - ensure kvm processes are running on cloudvirt1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:43:47] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:48:13] PROBLEM - ensure kvm processes are running on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:52:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-16 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:56:23] PROBLEM - ensure kvm processes are running on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:14] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bookworm [19:03:21] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:03:33] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [19:04:15] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bookworm [19:04:37] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1057.eqiad.wmnet with OS bookworm [19:04:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1051.eqiad.wmnet with OS bookworm [19:05:04] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1059.eqiad.wmnet with OS bookworm [19:05:08] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1060.eqiad.wmnet with OS bookworm [19:12:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgeweblight-10-16 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [19:14:50] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:16:30] RECOVERY - Check unit status of backup_vms on cloudbackup1004 is OK: OK: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms [19:18:33] (SystemdUnitDown) resolved: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:18:33] (SystemdUnitDownForLong) resolved: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [19:26:16] 10Data-Services, 10cloud-services-team: [toolsdb] Migrate mixnmatch db to Trove - https://phabricator.wikimedia.org/T350862 (10fnegri) Migrating the data will definitely take some time, I explored a few options in {T328691} and eventually used `mydumper` to generate SQL backups and import them into Trove. In... [19:32:39] PROBLEM - Check for large files in client bucket on cloudvirt1060 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.149.12: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [19:38:05] RECOVERY - Check for large files in client bucket on cloudvirt1060 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [19:38:21] ACKNOWLEDGEMENT - Disk space on cloudvirt1060 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.149.12: Connection reset by peer Andrew Bogott mid-reimage https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudvirt1060&var-datasource=eqiad+prometheus/ops [19:39:15] PROBLEM - Host cloudvirt1051 is DOWN: PING CRITICAL - Packet loss = 100% [19:41:13] RECOVERY - Host cloudvirt1051 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:43:46] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bookworm completed: - cloudvirt1027 (**... [19:43:53] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:44:45] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:44:47] PROBLEM - Host cloudvirt1060 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:01] RECOVERY - Host cloudvirt1060 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:46:51] PROBLEM - ensure kvm processes are running on cloudvirt1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:48:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1051.eqiad.wmnet with OS bookworm completed: - cloudvirt1051 (**... [19:50:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1059.eqiad.wmnet with OS bookworm completed: - cloudvirt1059 (**... [19:50:51] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bookworm completed: - cloudvirt1026 (**... [19:52:51] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1057.eqiad.wmnet with OS bookworm completed: - cloudvirt1057 (**... [19:54:52] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1060.eqiad.wmnet with OS bookworm completed: - cloudvirt1060 (**... [19:55:46] !log fnegri@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:11:33] RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:11:59] RECOVERY - ensure kvm processes are running on cloudvirt1060 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:12:05] !log fnegri@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:15:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) Reimaged so far: `cloudvirt[1025-1027,1051,1057,1059-1060].eqiad.wmnet` [20:42:03] (TfInfraTestDestroyFailed) resolved: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:10:02] 10wikitech.wikimedia.org, 10MediaWiki-Blocks, 10MediaWiki-extensions-OAuth, 10Patch-For-Review, 10Wikimedia-production-error: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 (10Krinkle) From the incident investigation yesterday in `#wikimedia-... [21:10:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:24:23] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:33:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [21:36:03] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:36:19] (HAProxyBackendUnavailable) firing: (4) HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:40:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [21:41:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:41:19] (HAProxyBackendUnavailable) resolved: (7) HAProxy service glance-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:45:09] 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10Dzahn) @Reception123 How can we get a list of all the wikitide wikis? Is there an API we can ask for it? [21:46:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:53:15] 10VPS-project-Codesearch, 10VPS-project-Extdist, 10Gerrit, 10collaboration-services: Move clients off of gerrit-replica.wikimedia.org back to gerrit.wikimedia.org - https://phabricator.wikimedia.org/T336710 (10Dzahn) 05Open→03Stalled stalled since we have no consensus which direction to go if both ger... [21:55:01] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:55:12] 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10RhinosF1) >>! In T349660#9321120, @Dzahn wrote: > @Reception123 How can we get a list of all the wikitide wikis? Is there an API we can ask for it? Wikidiscover like... [21:58:20] 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10Dzahn) sounds good:) thanks RhinosF1 [22:00:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:05:03] (TfInfraTestDestroyFailed) resolved: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [22:10:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:15:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:16:03] (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:25:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:38:12] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) Thank you! I changed the settings so it now looks like I need to increase the number of continuous jobs. ` # toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-... [22:46:00] 10Tool-phab-ban: log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10bd808) [22:53:39] 10Tool-phab-ban, 10User-bd808: log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10bd808) 05Open→03In progress a:03bd808 [23:02:37] 10Tool-phab-ban, 10User-bd808: Link to audit log wiki page from UI - https://phabricator.wikimedia.org/T350890 (10bd808) 05Open→03In progress a:03bd808 [23:02:45] 10wikitech.wikimedia.org, 10MediaWiki-Blocks, 10MediaWiki-extensions-OAuth, 10Patch-For-Review, 10Wikimedia-production-error: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 (10tstarling) I want to reproduce this and test the fix, but I think... [23:03:06] 10wikitech.wikimedia.org, 10MediaWiki-Blocks, 10MediaWiki-extensions-OAuth, 10Patch-For-Review, 10Wikimedia-production-error: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 (10tstarling) a:03tstarling [23:06:15] 10Tool-phab-ban, 10Patch-For-Review, 10User-bd808: log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10CodeReviewBot) bd808 opened https://gitlab.wikimedia.org/toolforge-repos/phab-ban/-/merge_requests/6 Update audit log things [23:06:46] 10Tool-phab-ban, 10Patch-For-Review, 10User-bd808: Link to audit log wiki page from UI - https://phabricator.wikimedia.org/T350890 (10CodeReviewBot) bd808 opened https://gitlab.wikimedia.org/toolforge-repos/phab-ban/-/merge_requests/6 Update audit log things [23:08:34] 10Tool-phab-ban, 10Patch-For-Review, 10User-bd808: log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/phab-ban/-/merge_requests/6 Update audit log things [23:08:41] 10Tool-phab-ban, 10Patch-For-Review, 10User-bd808: Link to audit log wiki page from UI - https://phabricator.wikimedia.org/T350890 (10CodeReviewBot) bd808 merged https://gitlab.wikimedia.org/toolforge-repos/phab-ban/-/merge_requests/6 Update audit log things [23:11:00] 10Tool-phab-ban, 10User-bd808: Link to audit log wiki page from UI - https://phabricator.wikimedia.org/T350890 (10bd808) 05In progress→03Resolved [23:28:44] 10Tool-phab-ban, 10User-bd808: log_on_wiki assumes h2 headings but also generates h3 headings - https://phabricator.wikimedia.org/T350891 (10bd808) 05In progress→03Resolved [23:35:19] (03CR) 10BryanDavis: [C: 03+2] Remove extra deduplication from bot.do_phabecho [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/961432 (https://phabricator.wikimedia.org/T340675) (owner: 10BryanDavis) [23:35:59] (03Merged) 10jenkins-bot: Remove extra deduplication from bot.do_phabecho [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/961432 (https://phabricator.wikimedia.org/T340675) (owner: 10BryanDavis)