[00:26:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.set_maintenance [00:26:23] (ToolforgeKubernetesNodeNotReady) firing: Kubernetes node tools-k8s-worker-96 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [00:27:05] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.set_maintenance (exit_code=99) [00:31:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-96 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:35:03] (InstanceDown) resolved: Project project-proxy instance project-proxy-acme-chief-02 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:37:03] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [00:41:23] (ToolforgeKubernetesNodeNotReady) resolved: Kubernetes node tools-k8s-worker-96 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [00:45:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/labs/private on instance project-proxy-puppetserver-1 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [00:52:03] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [02:13:56] (ToolsToolsDBReplicationError) firing: ToolsDB replication is broken on tools-db-2 (errno 2003) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [02:13:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [02:15:25] (NodeDown) firing: The node cloudvirt1063 has been unreachable for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [02:19:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [02:23:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [02:23:56] (ToolsToolsDBReplicationError) resolved: ToolsDB replication is broken on tools-db-2 (errno 2003) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [02:35:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [02:35:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:04:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-24 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:19:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgeweblight-10-24 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:45:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/labs/private on instance project-proxy-puppetserver-1 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [04:04:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [04:16:20] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:20:40] (NodeDown) firing: Cloudvirt node cloudvirt1063 is down. #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1063 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [04:21:20] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [05:35:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [05:35:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:45:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/labs/private on instance project-proxy-puppetserver-1 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [07:04:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:24:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgeweblight-10-24 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [08:32:54] 10Cloud-VPS (Project-requests): Request creation of Adiutor VPS project - https://phabricator.wikimedia.org/T353421 (10Vikipolimer) [08:40:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [08:40:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:02:26] 10Cloud-VPS (Project-requests), 10Adiutor: Request creation of Adiutor VPS project - https://phabricator.wikimedia.org/T353421 (10Peachey88) [09:37:46] 10Toolforge (Toolforge iteration 02), 10Toolforge Build Service, 10Patch-For-Review: Add Rust buildpack to Toolforge build service - https://phabricator.wikimedia.org/T337066 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/22 rust: add buildp... [09:39:07] 10Toolforge (Toolforge iteration 02), 10Toolforge Build Service, 10Patch-For-Review: Add Rust buildpack to Toolforge build service - https://phabricator.wikimedia.org/T337066 (10dcaro) a:03dcaro [09:39:22] 10Toolforge (Toolforge iteration 02), 10Toolforge Build Service, 10Patch-For-Review: Add Rust buildpack to Toolforge build service - https://phabricator.wikimedia.org/T337066 (10dcaro) 05Open→03In progress [09:40:56] 10Toolforge (Toolforge iteration 02): [toolforge,gitlab] ensure we have a release before creating the mr on toolforge-deploy - https://phabricator.wikimedia.org/T353425 (10dcaro) [09:40:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [09:40:56] (ToolsToolsDBReplicationError) firing: ToolsDB replication is broken on tools-db-2 (errno 2003) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [09:45:03] (PuppetSyncFailure) firing: Failed to update Puppet repository /srv/git/labs/private on instance project-proxy-puppetserver-1 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [09:45:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [09:45:56] (ToolsToolsDBReplicationError) resolved: ToolsDB replication is broken on tools-db-2 (errno 2003) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [09:52:44] 10cloud-services-team: NodeDown Node cloudvirt1063 has been down for long. - https://phabricator.wikimedia.org/T353409 (10fnegri) [09:52:46] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T353406 (10fnegri) [09:52:53] 10cloud-services-team: NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T353406 (10fnegri) [10:25:03] (PuppetSyncFailure) resolved: Failed to update Puppet repository /srv/git/labs/private on instance project-proxy-puppetserver-1 in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [11:07:25] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] MariaDB process is killed by OOM killer (December 2023) - https://phabricator.wikimedia.org/T353093 (10fnegri) There was a new OOM crash this morning: ` fnegri@tools-db-1:... [11:31:17] (03CR) 10Urbanecm: [C: 03+2] "LGTM, but fails CI, see https://phabricator.wikimedia.org/T348871#9406193 for why." [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/982233 (https://phabricator.wikimedia.org/T310688) (owner: 10Nikerabbit) [11:32:03] (03CR) 10CI reject: [V: 04-1] Remove trailing whitespace [labs/tools/wikinity] - 10https://gerrit.wikimedia.org/r/982233 (https://phabricator.wikimedia.org/T310688) (owner: 10Nikerabbit) [11:36:37] (CephSlowOps) firing: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [11:36:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T352570 (10phaultfinder) [11:40:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [11:40:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [11:41:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:01:03] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1-Q2): [tbs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313 (10Lofhi) I don't know where to ask, but this is the new recommended way before deploying a web service on Tool... [12:07:48] 10Grid-Engine-to-K8s-Migration: Migrate dplbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319701 (10dcaro) >> If they can be split in different repositories, things are a bit easier as you would not need to have a multi-stack image. >> In that case you can deploy the... [12:41:23] (ToolforgeKubernetesNodeNotReady) firing: Kubernetes node tools-k8s-worker-30 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [12:46:23] (ToolforgeKubernetesNodeNotReady) resolved: Kubernetes node tools-k8s-worker-30 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [12:56:06] 10Toolforge (Toolforge iteration 02), 10Toolforge Build Service, 10Patch-For-Review: Add Rust buildpack to Toolforge build service - https://phabricator.wikimedia.org/T337066 (10dcaro) Got it working! \o/ (with the example app https://github.com/emk/rust-buildpack-example-rocket) [13:44:26] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/66 add cleanup [13:53:36] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10CodeReviewBot) dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/37 cleanup: add subcom... [14:40:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [14:40:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [16:02:51] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: Provide tools for disabling the grid for specific tools - https://phabricator.wikimedia.org/T353351 (10CodeReviewBot) andrew merged https://gitlab.wikimedia.org/repos/cloud/toolforge/disable-tool/-/merge_requests/2 Add tools for stopping grid jobs with... [16:09:47] 10cloud-services-team, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE Observability (FY2023/2024-Q2): Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457 (10fnegri) [16:12:40] 10Toolforge (Toolforge iteration 02): [toolforge,gitlab] ensure we have a release before creating the mr on toolforge-deploy - https://phabricator.wikimedia.org/T353425 (10dcaro) [16:12:45] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10dcaro) [16:13:19] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10dcaro) 05In progress→03Stalled [16:28:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [16:33:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [16:54:03] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10User-dcaro: [toolsdb] MariaDB process is killed by OOM killer (December 2023) - https://phabricator.wikimedia.org/T353093 (10fnegri) And another crash now: ` [Thu Dec 14 16:22:41 2023] Out of m... [17:19:51] 10Cloud-VPS, 10cloud-services-team, 10Language-Team (Language-2023-October-December): Rebuild (or upgrade the kernel on) mint.language.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T353185 (10Nikerabbit) 05Open→03Resolved [17:19:55] 10Cloud-VPS, 10cloud-services-team: Check Cloud VPS running kernels for ext4 data corruption bug - https://phabricator.wikimedia.org/T353178 (10Nikerabbit) [17:22:01] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1-Q2): [tbs] Create a tutorial on how to deploy a Node.js app using Build Service - https://phabricator.wikimedia.org/T353313 (10bd808) >>! In T353313#9406261, @Lofhi wrote: > I don't know where to ask, but this is the new recommended wa... [17:24:00] 10cloud-services-team (FY2023/2024-Q1-Q2), 10Cloud-Services-Origin-Alert, 10Cloud-Services-Worktype-Unplanned, 10Patch-For-Review, 10User-dcaro: [toolsdb] MariaDB process is killed by OOM killer (December 2023) - https://phabricator.wikimedia.org/T353093 (10fnegri) The log before the crash shows two very... [17:27:10] 10Cloud-VPS (Project-requests), 10cloud-services-team, 10Adiutor: Request creation of Adiutor VPS project - https://phabricator.wikimedia.org/T353421 (10bd808) p:05Triage→03Medium +1 [17:40:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [17:40:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:10:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:30:21] 10Cloud-VPS, 10Moderator-Tools-Team (Kanban): enable lists.wikimedia.org or wikimedia.org email addresses to receive dmarc reports for *.wmflabs.org - https://phabricator.wikimedia.org/T352902 (10jsn.sherman) >>! In T352902#9400607, @jsn.sherman wrote: > hmm; I see that exim is configured to use `root@wmcloud.... [19:14:56] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 3668 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [19:59:56] (ToolsToolsDBReplicationLagIsTooHigh) resolved: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 4695 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [20:09:28] 10VPS-project-Codesearch: codesearch crashes firefox tabs - https://phabricator.wikimedia.org/T353480 (10taavi) [20:40:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [20:45:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [20:51:26] 10cloud-services-team, 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) Server is in warranty Confirmed: Service Request 181697839 was successfully submitted. [20:52:47] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10taavi) [20:52:52] 10cloud-services-team: NodeDown - https://phabricator.wikimedia.org/T352595 (10taavi) [20:52:57] 10Cloud-VPS, 10cloud-services-team (Hardware), 10SRE, 10ops-eqiad: Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10taavi) [20:55:12] 10cloud-services-team (Hardware), 10Patch-For-Review: cloudvirt1019: hpssacli not found - https://phabricator.wikimedia.org/T313984 (10taavi) 05Stalled→03Resolved This host is long gone so I'm assuming this task is no longer relevant. [22:10:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:48:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [23:03:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [23:40:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [23:45:04] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed