[00:01:48] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [00:07:19] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [00:11:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:21:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:24:57] 10Tool-bub2: Switching Header.js to Functional stateless components creates an application error because withSession.js HOC expects a Class component - https://phabricator.wikimedia.org/T348471 (10Peter_Kampete) @Spykelionel I completed the task, this is he link to the issue on github https://github.com/coderwas... [02:31:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [02:48:04] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 17 deleted instances on integration-puppetmaster-02 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [02:58:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:01:49] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [03:07:19] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [03:29:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:14:40] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Add search bar in queue - https://phabricator.wikimedia.org/T315134 (10SharonKMwenda) Hi, @wassan.anmol117 @PMenon-WMF have opened a PR for this: [[ https://github.com/coderwassananmol/BUB2/pull/213/ | 213 ]] I have also left a comment, if... [04:31:11] 10Tool-bub2, 10Internet-Archive, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Author is not being sent to Internet Archive for Google Books - https://phabricator.wikimedia.org/T348186 (10wassan.anmol117) 05Open→03Resolved Merged and deployed. [04:31:33] 10Tool-bub2, 10Internet-Archive, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Add max character limit while creating identifier in Internet Archive and remove some special characters - https://phabricator.wikimedia.org/T348192 (10wassan.anmol117) 05Open→03Resolved Merged and deployed. [04:32:14] 10Tool-bub2, 10Internet-Archive, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Allow multi-lingual books to be uploaded to Internet Archive - https://phabricator.wikimedia.org/T346388 (10wassan.anmol117) 05Open→03Resolved Merged and deployed. [04:38:43] 10Toolforge, 10cloud-services-team, 10Fiwiki-Wikidata-Commons: Investigate why Toolforge www is slow - https://phabricator.wikimedia.org/T348599 (10Zache) [04:38:56] 10Cloud-VPS, 10Documentation, 10User-Frostly, 10good first task: Add doc type categories to Cloud VPS user docs - https://phabricator.wikimedia.org/T348049 (10Frostly) a:03Frostly [04:39:06] 10Data-Services, 10Documentation, 10User-Frostly: Revise and reformat Portal:Data_Services - https://phabricator.wikimedia.org/T348024 (10Frostly) a:03Frostly [04:39:20] 10Toolforge, 10Documentation, 10User-Frostly, 10good first task: Add doc type categories to Toolforge user docs - https://phabricator.wikimedia.org/T348047 (10Frostly) a:03Frostly [04:41:31] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Title not being shown and placeholder text is getting cut - https://phabricator.wikimedia.org/T348600 (10wassan.anmol117) [04:44:35] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Title not being shown and placeholder text is getting cut - https://phabricator.wikimedia.org/T348600 (10wassan.anmol117) a:03DO-NOT-CHANGE [05:31:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [05:48:05] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 17 deleted instances on integration-puppetmaster-02 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [06:01:49] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [06:07:19] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [06:39:51] 10cloud-services-team, 10MediaWiki-Engineering: Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10larissagaulia) [06:41:04] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.undrain_node [06:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [06:58:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:03:55] !log admin dcaro@urcuchillay START - Cookbook wmcs.ceph.osd.undrain_node [07:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [07:18:37] (CephClusterInWarning) firing: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWar [07:27:23] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Title not being shown and placeholder text is getting cut - https://phabricator.wikimedia.org/T348600 (10Akanksha.t05) Made PR for it - https://github.com/coderwassananmol/BUB2/pull/215 [07:29:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:39:07] (CephClusterInWarning) resolved: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInW [07:59:37] (CephClusterInWarning) firing: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWar [08:09:37] (CephClusterInWarning) resolved: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInW [08:24:36] 10Tool-bub2: Switching Header.js to Functional stateless components creates an application error because withSession.js HOC expects a Class component - https://phabricator.wikimedia.org/T348471 (10Spykelionel) 05In progress→03Resolved [08:31:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:42:33] 10Data-Services, 10cloud-services-team, 10Data-Platform-SRE: move cloudelastic behind cloudlb - https://phabricator.wikimedia.org/T346946 (10Gehel) p:05Triage→03Low [08:48:05] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 17 deleted instances on integration-puppetmaster-02 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [08:48:08] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/963698 (owner: 10L10n-bot) [08:48:08] 10Toolforge, 10cloud-services-team, 10Fiwiki-Wikidata-Commons: Investigate why Toolforge www is slow - https://phabricator.wikimedia.org/T348599 (10taavi) Hi! The main factor here is network delay. Toolforge is currently hosted at the WMF's [[ https://wikitech.wikimedia.org/wiki/Eqiad_data_center | eqiad da... [08:48:35] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/weapon-of-mass-description] - 10https://gerrit.wikimedia.org/r/963701 (owner: 10L10n-bot) [09:01:49] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [09:07:19] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [09:08:31] 10cloud-services-team, 10Data-Platform-SRE, 10Dumps-Generation, 10Patch-For-Review: clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run" - https://phabricator.wikimedia.org/T346165 (10BTullis) 05Open→03Resolved [09:34:57] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10MoritzMuehlenhoff) >>! In T342537#9240999, @Papaul wrote: > looking at the gerrit history about the late command i see also that there where some changes m... [09:37:06] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Title not being shown and placeholder text is getting cut - https://phabricator.wikimedia.org/T348600 (10PMenon-WMF) [10:13:40] !log admin dcaro@urcuchillay END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [10:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:31:38] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dummy keytabs for apt1002/apt2002 [labs/private] - 10https://gerrit.wikimedia.org/r/964900 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:34:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10fnegri) > new error popped up after rebooting > T348550 This seems to have resolved on its own? `/usr/local... [10:36:33] !log fnegri@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285) [10:36:39] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [10:42:37] !log fnegri@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285) [10:42:42] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [10:58:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:07:40] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Title not being shown and placeholder text is getting cut - https://phabricator.wikimedia.org/T348600 (10SharonKMwenda) https://github.com/coderwassananmol/BUB2/pull/216 - please check out this PR [11:15:13] (DiskSpace) firing: Disk space cloudbackup1004:9100:/ 5.513% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:29:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:30:03] (InstanceDown) firing: Project tools instance tools-puppetdb-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:30:37] (CephClusterInWarning) firing: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWar [11:31:03] (InstanceDown) firing: Project cloudinfra instance ntp-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:31:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [11:31:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [11:32:03] (InstanceDown) firing: Project project-proxy instance project-proxy-puppetmaster-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:35:38] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 340 bytes in 60.035 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:35:45] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:37:12] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/cron - 340 bytes in 60.005 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:39:20] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 55.199 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:39:22] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 24.220 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:40:37] (CephClusterInWarning) resolved: The ceph cluster in is in warning status, that means that it's high availability is compromised, things should still be working as expected. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInW [11:40:45] (ProbeDown) firing: (5) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:41:03] (InstanceDown) resolved: Project cloudinfra instance ntp-04 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:41:28] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:41:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [11:42:03] (InstanceDown) resolved: Project project-proxy instance project-proxy-puppetmaster-01 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:03] (InstanceDown) resolved: Project tools instance tools-puppetdb-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:45:13] (DiskSpace) resolved: Disk space cloudbackup1004:9100:/ 5.953% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [11:46:42] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [11:48:05] (PuppetStaleCertificates) firing: Found non-revoked Puppet certificates for 17 deleted instances on integration-puppetmaster-02 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [11:50:45] (ProbeDown) resolved: (5) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [11:59:53] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:01:49] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [12:04:53] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:07:19] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [12:11:53] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:15:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-76 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:16:53] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:18:53] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:20:03] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:23:28] 10Cloud-VPS, 10cloud-services-team: ceph slowdown 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10aborrero) Found potential SMART disk errors on cloudcephosd1027: `lang=shell-session aborrero@cloudcephosd1027:~ $ sudo journalctl | grep smart_failure | tail -20 Oct 11 09:18:01 cloudcephosd1027 smart_... [12:23:53] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:33:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:38:03] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:43:03] (PuppetAgentFailure) firing: (5) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:48:03] (PuppetAgentFailure) firing: (5) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:53:03] (PuppetAgentFailure) firing: (6) Puppet agent failure detected on instance tools-sgeweblight-10-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:55:38] 10Quarry: bastion for quarry - https://phabricator.wikimedia.org/T348642 (10rook) [12:56:11] 10Quarry: bastion for quarry - https://phabricator.wikimedia.org/T348642 (10rook) `quarry-bastion.quarry.eqiad1.wikimedia.cloud` deploying [12:56:56] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10dcaro) [12:57:52] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Change Header.js component to React hooks component - https://phabricator.wikimedia.org/T348415 (10SharonKMwenda) I have tackled this task [[ https://github.com/coderwassananmol/BUB2/pull/218 | PR ]] [12:58:03] (PuppetAgentFailure) firing: (8) Puppet agent failure detected on instance tools-sgeweblight-10-16 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [13:00:15] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9238635, @SD0001 wrote: > we don't use Puppet at all for this project? Puppet is used* *in the usual confusing puppet ways. The following directories in the puppet repo will do things to quarry: ` ./modules/p... [13:03:25] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10aborrero) [13:19:28] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:21:53] (ProbeDown) firing: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:24:10] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [13:24:39] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10Papaul) @MoritzMuehlenhoff thanks [13:26:53] (ProbeDown) resolved: Service tools-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:31:53] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:36:53] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:39:25] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240 (10taavi) [13:39:38] 10Cloud-VPS, 10cloud-services-team: ceph slowdown 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10taavi) [13:39:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240 (10taavi) [13:40:51] 10Cloud-VPS, 10cloud-services-team: ceph slowdown 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10taavi) 05Open→03Resolved a:03taavi User-facing issues are fixed and immediate issue is over, follow-up is being tracked on subtasks. [13:40:55] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240 (10taavi) [13:41:02] 10Cloud-VPS, 10cloud-services-team: ceph slowdown 2023-10-11 - https://phabricator.wikimedia.org/T348634 (10taavi) [13:41:06] 10Toolforge, 10cloud-services-team: Monitor the Toolforge API gateway - https://phabricator.wikimedia.org/T348633 (10taavi) [13:43:20] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [labs/tools/weapon-of-mass-description] - 10https://gerrit.wikimedia.org/r/964516 (owner: 10L10n-bot) [13:44:28] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:48:14] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/964515 (owner: 10L10n-bot) [13:48:31] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10fnegri) If it can be useful, I generated a summary of `Offline_Uncorrectable` sectors per host: https://phabricator.wikimedia.org/P52907 [13:59:01] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [14:10:49] 10Cloud-VPS, 10cloud-services-team, 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) I opened up a ticket with dell for 1 server right now Confirmed: Service Request 177592506 was successfully submitted. [14:12:45] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:17:09] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10SRE Observability (FY2023/2024-Q2): [wmcs-cookbooks] Downtime alerts from cloudcumins - https://phabricator.wikimedia.org/T347490 (10lmata) [14:18:37] (PuppetAgentFailure) firing: (4) Puppet agent failure detected on instance tools-sgeweblight-10-16 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:19:32] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [14:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:19:53] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [14:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:20:23] 10Quarry: bastion for quarry - https://phabricator.wikimedia.org/T348642 (10rook) 05Open→03Resolved [14:20:43] !log tools dcaro@urcuchillay START - Cookbook wmcs.openstack.cloudvirt.vm_console [14:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:21:05] !log tools dcaro@urcuchillay END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [14:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:24:25] !log tools dcaro@urcuchillay START - Cookbook wmcs.toolforge.grid.reboot_workers for weblight nodes (T348634) [14:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:24:29] T348634: ceph slowdown 2023-10-11 - https://phabricator.wikimedia.org/T348634 [14:27:45] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) @SD0001 moving k8s discussion from T348184 to here to match ticket descriptions. I've deployed quarry-bastion.quarry.eqiad1.wikimedia.cloud in T348642 and installed kubectl on it ` KUBE_VERSION="v1.24.... [14:30:08] 10Cloud-VPS, 10cloud-services-team: Recommended solution for Terraform state backend - https://phabricator.wikimedia.org/T318360 (10taavi) The new object storage service seems to work fine with Terraform 1.5.x. Once we have documentation for using the object storage in general I'll write some docs on how to us... [14:32:45] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:33:37] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [14:34:26] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10Slst2020) I lean towards Option 1 as my top choice, although transitioning from O... [14:41:37] !log tools dcaro@urcuchillay END (FAIL) - Cookbook wmcs.toolforge.grid.reboot_workers (exit_code=99) for weblight nodes (T348634) [14:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:41:41] T348634: ceph slowdown 2023-10-11 - https://phabricator.wikimedia.org/T348634 [14:41:53] PROBLEM - Check systemd state on cloudcephosd1025 is CRITICAL: CRITICAL - degraded: The following units failed: user@0.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:01] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10Jclark-ctr) @taavi What vlan are these going to be I would like to verify with @cmooney that these can go into these racks before i physically move them. [14:55:07] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10taavi) For the client I think one important condition is that we eventually want... [14:58:36] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:59:54] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, and 2 others: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) Thanks @Jclark-ctr yes these can go in E4 or F4 no problem. [15:01:08] vivian-rook opened https://github.com/toolforge/quarry/pull/27 [15:01:10] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/27 [15:01:31] 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662 (10dcaro) p:05Triage→03High [15:01:41] 10Toolforge, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Maintenance, 10User-dcaro: [toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662 (10dcaro) [15:01:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [15:03:37] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [15:08:37] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [15:11:56] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [15:12:29] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] aborrero: drop access [labs/private] - 10https://gerrit.wikimedia.org/r/964926 (owner: 10Arturo Borrero Gonzalez) [15:13:36] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account [YOUR USERNAME] - https://phabricator.wikimedia.org/T348663 (10Botoxparty) [15:14:02] 10cloud-services-team, 10wikitech.wikimedia.org, 10Trust-and-Safety: Account recovery help needed for Developer account Adamham - https://phabricator.wikimedia.org/T348663 (10Botoxparty) [15:19:50] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 01), 10cloud-services-team, 10Cloud-Services-Origin-Team, and 2 others: Decision request – Toolforge (re)architecture - https://phabricator.wikimedia.org/T346153 (10rook) I vote present. [15:28:37] (PuppetAgentFailure) resolved: (3) Puppet agent failure detected on instance tools-sgeweblight-10-20 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [15:28:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:33:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:56:11] RECOVERY - Check systemd state on cloudcephosd1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:03] 10Cloud-VPS, 10cloud-services-team: Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10JJMC89) [16:19:21] 10Cloud-VPS, 10cloud-services-team: Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10JJMC89) [16:20:12] 10Cloud-VPS, 10cloud-services-team: Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10JJMC89) None of the trove instances are in use yet, so feel free to make modifications or do tests with them as needed. [16:39:09] 10Cloud-VPS, 10cloud-services-team: Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) a:03fnegri [16:39:26] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) [16:41:40] 10Tools: duplicity returns "504 Gateway Time-out" - https://phabricator.wikimedia.org/T341190 (10M2k_dewiki) 05Invalid→03Open Hello, today duplicity returns again the error message "504 Gateway Time-out". Example: https://tools.wmflabs.org/wikidata-todo/duplicity.php?wiki=eswiki&norand=1&page=Delphine%5FNk... [16:41:42] 10Cloud-VPS, 10Toolforge, 10SRE: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki) [16:46:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) Some additional debugging messages can be found in the [instance details](https://horizon.wikimedia.org/project/databases/5... [16:47:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10JJMC89) That page only shows "No items to display." for me. [16:48:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-30 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [16:57:55] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) > That page only shows "No items to display." for me. Interesting, there might be some permissions missing, but that's a s... [17:02:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) I rebooted `testdb03` and still I cannot SSH into it (using `trove-debug-key-eqiad1`, only available to admins). I can SSH... [17:08:03] (PuppetAgentFailure) firing: (2) Puppet agent failure detected on instance tools-sgeweblight-10-28 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [17:08:53] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:09:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) p:05Triage→03High [17:10:07] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [17:10:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 (10fnegri) 05Open→03In progress [17:13:53] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:25:45] (ProbeDown) firing: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:30:45] (ProbeDown) resolved: (2) Service tools-k8s-haproxy-3:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:33:03] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance tools-sgeweblight-10-22 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [17:33:37] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:03:37] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [18:05:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-69 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:08:37] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [18:10:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-69 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:11:56] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [18:19:08] 10Tool-Global-user-contributions, 10Design-Research, 10IP Masking, 10Stewards-and-global-tools, and 3 others: [Design research] Understand usage of current GUC tool - https://phabricator.wikimedia.org/T347618 (10cwylo) [18:44:19] 10Tools: duplicity returns "504 Gateway Time-out" - https://phabricator.wikimedia.org/T341190 (10M2k_dewiki) At the moment duplicity is working again. Thanks a lot! [18:44:53] 10Tools: duplicity returns "504 Gateway Time-out" - https://phabricator.wikimedia.org/T341190 (10M2k_dewiki) 05Open→03Resolved [18:45:00] 10Cloud-VPS, 10Toolforge, 10SRE: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10M2k_dewiki) [18:52:53] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:57:53] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:59:22] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:08:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-41 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:13:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-41 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:20:53] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:25:53] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:45:33] (InstanceDown) firing: (2) Project tools instance tools-k8s-worker-31 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:47:30] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all workers [19:50:33] (InstanceDown) resolved: (2) Project tools instance tools-k8s-worker-31 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:18:22] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10rook) I've updated the branch associated with this ticket. And it does seem to get quarry running in minikube. Suggesting that we are close to being able to go to k8s? Though we probably need T316958 before g... [20:19:45] vivian-rook opened https://github.com/toolforge/quarry/pull/28 [20:19:48] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/28 [20:33:03] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance tools-sgeweblight-10-22 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [20:33:37] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:34:01] 10Tool-bub2, 10Outreach-Programs-Projects, 10Outreachy (Round 27): Title not being shown and placeholder text is getting cut - https://phabricator.wikimedia.org/T348600 (10Okerekechinweotito) I have made a PR that fixes this issue - https://github.com/coderwassananmol/BUB2/pull/219 [21:03:37] (CodesearchConfigWriteFailed) firing: codesearch-write-config.service failed on codesearch8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchConfigWriteFailed [21:08:37] (CodesearchBackendDown) firing: (2) Codesearch backend design is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DCodesearchBackendDown [21:11:56] (ToolsGridQueueProblem) firing: (3) Grid queue webgrid-lighttpd@tools-sgeweblight-10-14.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [23:03:33] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:09:12] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bullseye [23:23:55] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bullseye [23:33:03] (PuppetAgentFailure) firing: (3) Puppet agent failure detected on instance tools-sgeweblight-10-22 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [23:33:37] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed