[00:00:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[00:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[00:56:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[01:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[01:10:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[01:15:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[01:41:52] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: add pre-commit to maintain-harbor - https://phabricator.wikimedia.org/T350452 (10Raymond_Ndibe)
[01:43:01] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe updated https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/14  [maintain-h...
[01:43:40] <wikibugs>	 10Toolforge (Toolforge iteration 02): [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10Raymond_Ndibe)
[01:44:05] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10Raymond_Ndibe) 05Open→03In progress
[01:44:17] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe closed https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/14  [maintain-ha...
[01:45:07] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[01:45:47] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[01:45:49] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[01:45:58] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: move from single script to multi-script approach in maintain-harbor - https://phabricator.wikimedia.org/T350410 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_req...
[01:46:05] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[01:46:12] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: add pre-commit to maintain-harbor - https://phabricator.wikimedia.org/T350452 (10Raymond_Ndibe) 05Open→03In progress
[01:46:21] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[01:46:53] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[01:58:27] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[02:28:27] <jinxer-wm>	 (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[02:35:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[02:40:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:01:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[03:23:14] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[03:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[03:44:28] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[03:50:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:56:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[04:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[04:41:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[04:47:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[04:56:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[04:59:16] <wikibugs>	 10PAWS: PAWS shell - lack of i18n submodule or files or an outdated submodule - https://phabricator.wikimedia.org/T343676 (10Info-farmer) In my Debian 10 OS KDE, i installed pywikibot. I faced this issue. When i updated the i18n subfolder ('core_stable/scripts/i18n'). Later, i never faced this issue. Thanks inde...
[05:38:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[05:48:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[05:52:58] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Another occurrence today:  ` Nov 03 05:34:07 tools-db-1 systemd[1]: mariadb.service: A process of this un...
[06:27:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[06:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[06:35:10] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[06:36:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:01:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[07:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[07:23:14] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[07:36:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:43:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:44:21] <icinga-wm>	 RECOVERY - Check unit status of purge_vm_backup on cloudbackup1004 is OK: OK: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:44:42] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[07:46:34] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[07:46:34] <jinxer-wm>	 (SystemdUnitDownForLong) resolved: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[08:06:16] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Looks like it happened again: ` [2023-11-02 17:02:06 +0000] [11] [INFO] Booting worker with pid: 11 [2023-11-02 23:03:43 +0000] [11] [ERROR] Error handling request / Traceback (most recent call last):   File "/usr/loc...
[08:15:29] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) Perhaps gunicorn maintains connections differently than uwsgi?
[08:23:33] <wikibugs>	 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10rook)
[08:24:48] <wikibugs>	 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/344
[08:24:58] <notefromgithub>	 vivian-rook opened https://github.com/toolforge/paws/pull/344
[08:38:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[08:47:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:23:12] <wikibugs>	 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10taavi) The tool's web service seems to be just serving some static HTML and other files, so it should be relatively simple to migrate that t...
[09:27:09] <wikibugs>	 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/paws/pull/344
[09:27:17] <notefromgithub>	 vivian-rook closed https://github.com/toolforge/paws/pull/344
[09:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[10:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[10:14:28] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[10:27:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:32:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:32:37] <jinxer-wm>	 (CephSlowOps) firing: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[10:32:38] <wikibugs>	 10Data-Services, 10DBA, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Prepare and check storage layer for fonwiki - https://phabricator.wikimedia.org/T347938 (10Gehel) p:05Triage→03High
[10:32:44] <wikibugs>	 10cloud-services-team: CephSlowOps  Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder)
[10:37:37] <jinxer-wm>	 (CephSlowOps) resolved: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[10:46:01] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri)
[10:47:18] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri)
[10:51:30] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) I can run the two failing commands just fine, so maybe the issue is in the SSH connection from cloudcumin to the test hosts?   The following commands work a...
[10:52:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) Or rather, in the SSH connection from cloudcontrol1005 to the test hosts, as the test commands are executed on cloudcontrol1005.
[10:55:59] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) Hmm, I can run the commands successfully from cloudcontrol1005:  ` root@cloudcontrol1005:~# /usr/bin/ssh -i /etc/networktests/sshkeyfile -o User=srv-network...
[11:01:23] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) And now I just did 5 consecutive runs of `sudo cookbook wmcs.openstack.network.tests --cluster-name eqiad1` with no errors at all.
[11:04:06] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 (10fnegri) 05Open→03Invalid
[11:04:10] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri)
[11:08:52] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bookworm
[11:23:14] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[11:27:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:34:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:54:53] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bookworm completed: - cloudnet1006 (**PA...
[12:00:57] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate erinnermich from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319728 (10Tkarcher) 05Open→03Resolved Migration completed successfully: Erinnermich is running on Kubernetes now.
[12:04:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[12:13:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[12:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[12:48:25] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Host rebooted by fnegri@cumin1001 with reason: Rebooting to test if network bridges come up as expected
[12:58:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[13:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[13:50:25] <wikibugs>	 10PAWS: jupyterlab to 4.0.8 - https://phabricator.wikimedia.org/T350459 (10rook) 05Open→03Resolved
[13:55:46] <wikibugs>	 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10JJMC89)
[14:14:29] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[14:15:06] <wikibugs>	 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10taavi) a:03taavi Hi! The jobs `anchor-corrector` is currently running on the grid engine are cron jobs, not continuous jobs. Are you planning to change from cron jobs to cont...
[14:15:14] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate anchor-corrector from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319555 (10taavi)
[14:15:17] <wikibugs>	 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10taavi)
[14:19:25] <wikibugs>	 10Toolforge, 10Fix-Suggester-Bot: File system access is very slow - https://phabricator.wikimedia.org/T350432 (10taavi) NFS is known to be slow, yes. `fixsuggesterbot` will likely get much faster if it uses `/tmp` for the temporary working clones instead of something under `/data/project`.
[14:37:20] <icinga-wm>	 PROBLEM - Host cloudcephosd1031 is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:15] <wikibugs>	 10Toolforge, 10Fix-Suggester-Bot: File system access is very slow - https://phabricator.wikimedia.org/T350432 (10kostajh) 05Open→03Resolved a:03kostajh >>! In T350432#9305015, @taavi wrote: > NFS is known to be slow, yes. `fixsuggesterbot` will likely get much faster if it uses `/tmp` for the temporary w...
[14:44:04] <icinga-wm>	 RECOVERY - Host cloudcephosd1031 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[14:44:57] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr)
[14:49:01] <wikibugs>	 10Toolforge, 10cloud-services-team: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) 05Open→03Resolved
[14:49:08] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi)
[15:06:59] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[15:23:14] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[15:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[15:47:40] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Turns out that value is already set in `/lib/systemd/system/mariadb.service` and I verified it is applied...
[16:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[16:16:29] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[16:19:29] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[16:19:40] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) I asked for help in `#wikimedia-data-persistence` and @Marostegui had some tips:  ` [15:55:46] <dhinus>...
[16:23:13] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) I decreased the buffer pool size from 31G to 20G, and enabled slow query logging for queries longer than...
[16:26:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) Slow queries are being logged to `/srv/labsdb/data/tools-db-1-slow.log` and I verified the logging works...
[16:28:09] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[16:28:28] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[16:29:29] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[16:30:07] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[16:30:09] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[17:39:36] <wikibugs>	 10VPS-project-Wikistats: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10Dzahn) I saw on another task that Wikiapiary has been readonly for a couple months now but also that there is still work planned to get it back.
[17:42:33] <wikibugs>	 10VPS-project-Wikistats: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10Dzahn) wow, that's an insane percentage of broken ones. thank you for dropping them!  I hope this also fixes the "slow to load table" issue and we can make updates work again.. and properly rename it t...
[18:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[18:43:45] <wikibugs>	 (03PS1) 10Eevans: cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993)
[18:51:28] <wikibugs>	 10cloud-services-team: Need baremetal system(s) with internet access - https://phabricator.wikimedia.org/T349003 (10rook) Putting: ` vars:   proxy_env:     http_proxy: "http://webproxy:8080"     https_proxy: "http://webproxy:8080" ` at the top of a role gets it to work. Though so far kolla seems to ignore more g...
[18:57:55] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) So I worked out that the page is causing the browser to OOM, I moved it to generating on machine, apache OOMs.  It's 1.1 million lines of H...
[19:09:03] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) @Xqt: do you have a way to get an up to date list of fandom wikis?  We simply can't display all but if we had some then we could show somet...
[19:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[19:14:29] <jinxer-wm>	 (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[19:28:00] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[19:51:03] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) So i replaced wikia.com with fandom.com, updates running. I haven't merged the commit on gitlab yet.
[19:51:52] <wikibugs>	 10VPS-project-Wikistats: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10CodeReviewBot) rhinosf1 updated https://gitlab.wikimedia.org/cloudvps-repos/wikistats/-/merge_requests/5  Fix wikia
[19:52:08] <wikibugs>	 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10RhinosF1) 05Open→03In progress a:03RhinosF1 https://gitlab.wikimedia.org/cloudvps-repos/wikistats/-/merge_requests/5
[19:52:15] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1)
[19:53:05] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) I will delete all broken wikis again once we update to using fandom.com, then we'll have to look at how to limit results returned to top X...
[20:04:01] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) p:05Low→03Medium
[20:04:12] <wikibugs>	 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10RhinosF1) p:05Triage→03Medium
[20:55:42] <wikibugs>	 10VPS-project-Wikistats, 10Patch-For-Review, 10User-RhinosF1: wikia was renamed to fandom - https://phabricator.wikimedia.org/T221537 (10CodeReviewBot) dzahn merged https://gitlab.wikimedia.org/cloudvps-repos/wikistats/-/merge_requests/5  Fix wikia
[21:11:04] <wikibugs>	 10VPS-project-Wikistats, 10Code-Health-Help-Wanted, 10Performance Issue: wikistats does not work for wikia sites - https://phabricator.wikimedia.org/T215534 (10RhinosF1) >>! In T215534#9305772, @RhinosF1 wrote: > I will delete all broken wikis again once we update to using fandom.com, then we'll have to look...
[21:34:03] <wmcs-alerts>	 (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed
[21:36:32] <wikibugs>	 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10Dzahn) ` MariaDB [(none)]> insert into mediawikis (statsurl, method) values ("https://warcraft.wiki.gg/api.php","8");  `   ` /usr/lib/wikistats/update.php mw new .. ---> update mediawikis set total="563149",good="275400",edits=...
[21:36:43] <wikibugs>	 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10Dzahn) 05Open→03Resolved
[22:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[22:29:58] <wikibugs>	 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) This is my current configuration, maybe you can see if it fits?   ` # toolforge-jobs load toolforge-jobs-cewbot.yml  # 修正失效的章節標題 Fix broken anchor - name: k8s-tools....
[22:30:46] <wikibugs>	 10Tools: No space left on device with CropTool - https://phabricator.wikimedia.org/T350475 (10Peachey88) 05Open→03Invalid CropTool manages its tasks on GitHub issues @ https://github.com/danmichaelo/croptool/issues  COuld you please refile the task there?
[22:56:44] <wikibugs>	 10Tools: No space left on device with CropTool - https://phabricator.wikimedia.org/T350475 (10Yann) Done https://github.com/danmichaelo/croptool/issues/185
[23:28:00] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange