[00:00:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:56:48] (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [01:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:10:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:15:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:41:52] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: add pre-commit to maintain-harbor - https://phabricator.wikimedia.org/T350452 (10Raymond_Ndibe) [01:43:01] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe updated https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/14 [maintain-h... [01:43:40] 10Toolforge (Toolforge iteration 02): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10Raymond_Ndibe) [01:44:05] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10Raymond_Ndibe) 05Open→03In progress [01:44:17] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe closed https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/14 [maintain-ha... [01:45:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [01:45:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [01:45:49] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [01:45:58] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_req... [01:46:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [01:46:12] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: add pre-commit to maintain-harbor - https://phabricator.wikimedia.org/T350452 (10Raymond_Ndibe) 05Open→03In progress [01:46:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [01:46:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [01:58:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:28:27] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:35:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [02:40:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:01:48] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:23:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:44:28] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:50:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:56:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:41:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:47:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:56:48] (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [04:59:16] 10PAWS: PAWS shell - lack of i18n submodule or files or an outdated submodule - https://phabricator.wikimedia.org/T343676 (10Info-farmer) In my Debian 10 OS KDE, i installed pywikibot. I faced this issue. When i updated the i18n subfolder ('core_stable/scripts/i18n'). Later, i never faced this issue. Thanks inde... [05:38:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [05:48:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [05:52:58] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Another occurrence today: ` Nov 03 05:34:07 tools-db-1 systemd[1]: mariadb.service: A process of this un... [06:27:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [06:35:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [06:36:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:01:48] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:23:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:36:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:43:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:44:21] RECOVERY - Check unit status of purge_vm_backup on cloudbackup1004 is OK: OK: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:44:42] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:46:34] (SystemdUnitDown) resolved: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:46:34] (SystemdUnitDownForLong) resolved: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [08:06:16] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Looks like it happened again: ` [2023-11-02 17:02:06 +0000] [11] [INFO] Booting worker with pid: 11 [2023-11-02 23:03:43 +0000] [11] [ERROR] Error handling request / Traceback (most recent call last): File "/usr/loc... [08:15:29] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Perhaps gunicorn maintains connections differently than uwsgi? [08:23:33] 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10rook) [08:24:48] 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/344 [08:24:58] vivian-rook opened https://github.com/toolforge/paws/pull/344 [08:38:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:47:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:23:12] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10taavi) The tool's web service seems to be just serving some static HTML and other files, so it should be relatively simple to migrate that t... [09:27:09] 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/344 [09:27:17] vivian-rook closed https://github.com/toolforge/paws/pull/344 [09:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [10:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:14:28] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:27:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:32:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:32:37] (CephSlowOps) firing: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:32:38] 10Data-Services, 10DBA, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 (10Gehel) p:05Triage→03High [10:32:44] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [10:37:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [10:46:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [10:47:18] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) [10:51:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) I can run the two failing commands just fine, so maybe the issue is in the SSH connection from cloudcumin to the test hosts? The following commands work a... [10:52:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) Or rather, in the SSH connection from cloudcontrol1005 to the test hosts, as the test commands are executed on cloudcontrol1005. [10:55:59] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) Hmm, I can run the commands successfully from cloudcontrol1005: ` root@cloudcontrol1005:~# /usr/bin/ssh -i /etc/networktests/sshkeyfile -o User=srv-network... [11:01:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) And now I just did 5 consecutive runs of `sudo cookbook wmcs.openstack.network.tests --cluster-name eqiad1` with no errors at all. [11:04:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) 05Open→03Invalid [11:04:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [11:08:52] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bookworm [11:23:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:27:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:34:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:54:53] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bookworm completed: - cloudnet1006 (**PA... [12:00:57] 10Grid-Engine-to-K8s-Migration: Migrate erinnermich from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319728 (10Tkarcher) 05Open→03Resolved Migration completed successfully: Erinnermich is running on Kubernetes now. [12:04:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:13:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:48:25] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Host rebooted by fnegri@cumin1001 with reason: Rebooting to test if network bridges come up as expected [12:58:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:50:25] 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10rook) 05Open→03Resolved [13:55:46] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10JJMC89) [14:14:29] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:15:06] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10taavi) a:03taavi Hi! The jobs `anchor-corrector` is currently running on the grid engine are cron jobs, not continuous jobs. Are you planning to change from cron jobs to cont... [14:15:14] 10Grid-Engine-to-K8s-Migration: Migrate anchor-corrector from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319555 (10taavi) [14:15:17] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10taavi) [14:19:25] 10Toolforge, 10Fix-Suggester-Bot: File system access is very slow - https://phabricator.wikimedia.org/T350432 (10taavi) NFS is known to be slow, yes. `fixsuggesterbot` will likely get much faster if it uses `/tmp` for the temporary working clones instead of something under `/data/project`. [14:37:20] PROBLEM - Host cloudcephosd1031 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:15] 10Toolforge, 10Fix-Suggester-Bot: File system access is very slow - https://phabricator.wikimedia.org/T350432 (10kostajh) 05Open→03Resolved a:03kostajh >>! In T350432#9305015, @taavi wrote: > NFS is known to be slow, yes. `fixsuggesterbot` will likely get much faster if it uses `/tmp` for the temporary w... [14:44:04] RECOVERY - Host cloudcephosd1031 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:44:57] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [14:49:01] 10Toolforge, 10cloud-services-team: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) 05Open→03Resolved [14:49:08] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) [15:06:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [15:23:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:47:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Turns out that value is already set in `/lib/systemd/system/mariadb.service` and I verified it is applied... [16:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:16:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [16:19:29] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:19:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) I asked for help in `#wikimedia-data-persistence` and @Marostegui had some tips: ` [15:55:46] ... [16:23:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) I decreased the buffer pool size from 31G to 20G, and enabled slow query logging for queries longer than... [16:26:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Slow queries are being logged to `/srv/labsdb/data/tools-db-1-slow.log` and I verified the logging works... [16:28:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [16:28:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [16:29:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [16:30:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [16:30:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [17:39:36] 10VPS-project-Wikistats: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10Dzahn) I saw on another task that Wikiapiary has been readonly for a couple months now but also that there is still work planned to get it back. [17:42:33] 10VPS-project-Wikistats: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10Dzahn) wow, that's an insane percentage of broken ones. thank you for dropping them! I hope this also fixes the "slow to load table" issue and we can make updates work again.. and properly rename it t... [18:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:43:45] (03PS1) 10Eevans: cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) [18:51:28] 10cloud-services-team: Need baremetal system(s) with internet access - https://phabricator.wikimedia.org/T349003 (10rook) Putting: ` vars: proxy_env: http_proxy: "http://webproxy:8080" https_proxy: "http://webproxy:8080" ` at the top of a role gets it to work. Though so far kolla seems to ignore more g... [18:57:55] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) So I worked out that the page is causing the browser to OOM, I moved it to generating on machine, apache OOMs. It's 1.1 million lines of H... [19:09:03] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) @Xqt: do you have a way to get an up to date list of fandom wikis? We simply can't display all but if we had some then we could show somet... [19:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:14:29] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:28:00] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:51:03] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) So i replaced wikia.com with fandom.com, updates running. I haven't merged the commit on gitlab yet. [19:51:52] 10VPS-project-Wikistats: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10CodeReviewBot) rhinosf1 updated https://gitlab.wikimedia.org/cloudvps-repos/wikistats/-/merge_requests/5 Fix wikia [19:52:08] 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10RhinosF1) 05Open→03In progress a:03RhinosF1 https://gitlab.wikimedia.org/cloudvps-repos/wikistats/-/merge_requests/5 [19:52:15] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) [19:53:05] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) I will delete all broken wikis again once we update to using fandom.com, then we'll have to look at how to limit results returned to top X... [20:04:01] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) p:05Low→03Medium [20:04:12] 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10RhinosF1) p:05Triage→03Medium [20:55:42] 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10CodeReviewBot) dzahn merged https://gitlab.wikimedia.org/cloudvps-repos/wikistats/-/merge_requests/5 Fix wikia [21:11:04] 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) >>! In T215534#9305772, @RhinosF1 wrote: > I will delete all broken wikis again once we update to using fandom.com, then we'll have to look... [21:34:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:36:32] 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10Dzahn) ` MariaDB [(none)]> insert into mediawikis (statsurl, method) values ("https://warcraft.wiki.gg/api.php","8"); ` ` /usr/lib/wikistats/update.php mw new .. ---> update mediawikis set total="563149",good="275400",edits=... [21:36:43] 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10Dzahn) 05Open→03Resolved [22:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:29:58] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) This is my current configuration, maybe you can see if it fits? ` # toolforge-jobs load toolforge-jobs-cewbot.yml # 修正失效的章節標題 Fix broken anchor - name: k8s-tools.... [22:30:46] 10Tools: No space left on device with CropTool - https://phabricator.wikimedia.org/T350475 (10Peachey88) 05Open→03Invalid CropTool manages its tasks on GitHub issues @ https://github.com/danmichaelo/croptool/issues COuld you please refile the task there? [22:56:44] 10Tools: No space left on device with CropTool - https://phabricator.wikimedia.org/T350475 (10Yann) Done https://github.com/danmichaelo/croptool/issues/185 [23:28:00] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange