[00:13:08] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [00:13:42] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [00:14:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1061.eqiad.wmnet with OS bookworm completed: - cloudvirt1061 (**... [00:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:37:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot [00:38:20] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) [00:39:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1058' [00:40:08] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1058' [00:41:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [00:41:59] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [00:44:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [00:44:51] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [00:45:00] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:46:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [00:46:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [00:46:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [00:46:26] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [00:46:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [00:47:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [00:48:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [00:52:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:54:23] 10Tool-masto-collab: masto-collab - 422 error when trying to approve posts - https://phabricator.wikimedia.org/T351012 (10Peachey88) [01:03:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [01:04:05] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1058.eqiad.wmnet with OS bookworm [01:05:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [01:21:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [01:22:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [01:23:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [01:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:48:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1058.eqiad.wmnet with OS bookworm completed: - cloudvirt1058 (**... [01:53:33] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [01:54:00] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [01:58:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:12:59] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:13:04] 10cloud-services-team: PuppetFailure cloudcumin1001:9100 Puppet failure on cloudcumin1001:9100 - https://phabricator.wikimedia.org/T351013 (10phaultfinder) [02:13:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [02:14:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:41:07] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [02:41:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:42:50] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [03:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [03:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [04:45:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:52:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [05:28:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:04:24] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:09:24] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:12:59] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:14:24] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:19:24] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [07:39:04] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-mail-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:44:03] (PuppetAgentNoResources) firing: (4) No Puppet resources found on instance toolsbeta-mail-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:49:03] (PuppetAgentNoResources) firing: (13) No Puppet resources found on instance toolsbeta-harbor-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:54:03] (PuppetAgentNoResources) firing: (22) No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:59:03] (PuppetAgentNoResources) firing: (29) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:04:03] (PuppetAgentNoResources) firing: (35) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:09:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:49:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:52:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [09:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [10:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:19:24] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:02:31] 10PAWS: New upstream release 8.5.1 for Pywikibot - https://phabricator.wikimedia.org/T351015 (10Xqt) [11:09:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:45:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:48:00] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:49:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:52:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:19:37] (CephSlowOps) firing: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:19:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [13:24:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [14:09:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:15:26] PROBLEM - Check unit status of purge_vm_backup on cloudbackup1004 is CRITICAL: CRITICAL: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:18:33] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:19:25] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:30:15] (03Abandoned) 10Merlijn van Deen: Add GitLab to Redis webhook [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/710665 (https://phabricator.wikimedia.org/T288381) (owner: 10Merlijn van Deen) [15:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:37:42] 10Wikibugs: Better message than "This change is ready for review" when patch stops being WIP - https://phabricator.wikimedia.org/T350778 (10valhallasw) Having logged a WIP change, I'm actually a bit confused by the behaviour you're describing. From what I can see, there _is_ now a new event, but that event shoul... [15:41:58] (03Abandoned) 10Merlijn van Deen: Create change test [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/973337 (owner: 10Merlijn van Deen) [16:09:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:13:34] (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [16:14:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:19:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:24:03] (PuppetAgentNoResources) firing: (41) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:29:03] (PuppetAgentNoResources) firing: (40) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:34:03] (PuppetAgentNoResources) firing: (32) No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:39:03] (PuppetAgentNoResources) resolved: (26) No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:46:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:48:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:49:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:52:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:14:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:18:48] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:19:25] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [19:05:12] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:17:00] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:17:37] (CephSlowOps) firing: Ceph cluster in eqiad has 8 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:17:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [19:22:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 8 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [20:13:48] (SystemdUnitDownForLong) firing: The systemd unit purge_vm_backup.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [20:46:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1019:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:48:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on clouddb1015:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:52:42] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:31:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [22:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:18:49] (SystemdUnitDown) firing: The service unit purge_vm_backup.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:23:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse