[00:14:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:19:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:34:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:39:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:44:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:44:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:49:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:49:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [00:50:29] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:59:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:14:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:19:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:39:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:54:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [01:59:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:59:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:19:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:24:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:24:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [02:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:40:29] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:53:35] (HAProxyBackendUnavailable) firing: (2) HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [03:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:04:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:14:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:14:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:19:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:19:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [03:49:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [04:03:20] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:08:20] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:11:37] (CephSlowOps) firing: Ceph cluster in eqiad has 105 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [04:11:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [04:14:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:15:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:16:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 14 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [04:20:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:24:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:24:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [04:24:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:41:56] (ToolsToolsDBReplicationError) firing: ToolsDB replication is broken on tools-db-2 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [04:42:56] (ToolsToolsDBReplicationMissing) firing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [04:50:19] 10PAWS, 10Pywikibot, 10Documentation, 10Pywikibot-Documentation, 10good first task: Move Pywikibot PAWS tutorial out of Jupyter notebook onto wiki - https://phabricator.wikimedia.org/T342397 (10Enag2000) The draft has been completed, pending a few minor formatting edits. If approved, I will move it out o... [04:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:14:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:14:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:19:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:19:04] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:34:04] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:39:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:43:34] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:44:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [05:53:34] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:14:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [06:14:04] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [06:19:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [06:24:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [06:24:04] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [06:29:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [06:29:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [06:33:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:40:45] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:53:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:01:02] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:04:04] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [07:04:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [07:12:14] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:24:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:41:56] (ToolsToolsDBReplicationError) firing: ToolsDB replication is broken on tools-db-2 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [07:42:56] (ToolsToolsDBReplicationMissing) firing: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [07:49:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:04:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:04:04] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:08:35] (HAProxyBackendUnavailable) firing: (2) HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:12:26] 10Tools: khanamalumat has a job that puts a lot of text in a log file when not doing any changes - https://phabricator.wikimedia.org/T278199 (10Aklapper) a:05Ameen.Akbar→03None @Ameen.Akbar: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting th... [08:12:44] 10cloud-services-team, 10Dumps-Generation, 10User-dcaro: Restructure rsync of XML/SQL dumps and dumpsdata space/network/disk use - https://phabricator.wikimedia.org/T289048 (10Aklapper) a:05ArielGlenn→03None @ArielGlenn: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/As... [08:13:08] 10Cloud-VPS, 10cloud-services-team: Investigate new roles and policies in openstack Xena - https://phabricator.wikimedia.org/T276018 (10Aklapper) a:05Andrew→03None @Andrew: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of th... [08:13:11] 10Cloud-VPS, 10Data-Services, 10cloud-services-team, 10User-Marostegui: Investigate, adjust default access policies for Trove and trove-dashboard - https://phabricator.wikimedia.org/T281655 (10Aklapper) a:05Andrew→03None @Andrew: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_ma... [08:13:23] 10Tools: Move cdnjs' cron job to the beta toolforge jobs api - https://phabricator.wikimedia.org/T286804 (10Aklapper) a:05Bstorm→03None @Bstorm: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not... [08:13:32] 10Tools: Shut down certmon? - https://phabricator.wikimedia.org/T284947 (10Aklapper) a:05Bstorm→03None @Bstorm: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please corr... [08:14:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:15:02] 10Tools: zoomviewer taking up a lot of NFS space -- please clean up - https://phabricator.wikimedia.org/T285018 (10Aklapper) a:05dschwen→03None @dschwen: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there... [08:17:53] 10Tool-bodh: Enable adding or editing senses with lang codes - https://phabricator.wikimedia.org/T285639 (10Aklapper) a:05Jay-A2K→03None @Jay-A2K: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has no... [08:17:57] 10Tool-bodh: Support monolingual texts in Bodh tool - https://phabricator.wikimedia.org/T285654 (10Aklapper) a:05Jay-A2K→03None @Jay-A2K: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been pr... [08:18:01] 10Tool-bodh: Support media file in bodh tool - https://phabricator.wikimedia.org/T285655 (10Aklapper) a:05Jay-A2K→03None @Jay-A2K: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress... [08:19:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:23:13] 10cloud-services-team, 10Infrastructure-Foundations, 10SRE, 10Puppet: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Aklapper) a:05Paladox→03None @Paladox: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cl... [08:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:38:20] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:39:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:43:20] (HAProxyBackendUnavailable) firing: (3) HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:43:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:44:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [08:44:04] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [08:53:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:14:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:19:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:34:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:43:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:44:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:47:20] 10Cloud-Services: Linting problems found for NovafullstackSustainedFailures - https://phabricator.wikimedia.org/T351698 (10fgiunchedi) The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more spec... [09:49:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:49:04] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:53:34] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:59:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:59:04] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:03:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:04:03] (PuppetAgentStaleLastRun) firing: (2) Last Puppet run was over 24 hours ago on instance metricsinfra-haproxy-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:04:04] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:04:54] 10Cloud-VPS, 10cloud-services-team: Linting problems found for NovafullstackSustainedFailures - https://phabricator.wikimedia.org/T351698 (10taavi) a:03taavi [10:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:10:15] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [10:11:39] !log taavi@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.restart_openstack (exit_code=99) [10:13:33] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:13:58] 10Cloud-VPS, 10cloud-services-team: Linting problems found for NovafullstackSustainedFailures - https://phabricator.wikimedia.org/T351698 (10taavi) AFAICT this is because the metrics are being exported via node-exporter which ends up in the `ops` instance but the alert is configured on the `cloud` instance. Gi... [10:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:18:20] (HAProxyBackendUnavailable) resolved: (2) HAProxy service nova-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:27:56] (ToolsToolsDBReplicationMissing) resolved: ToolsDB replication is not running on tools-db-1 (errno 0) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationMissing [10:28:56] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 20254 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [10:31:56] (ToolsToolsDBReplicationError) resolved: ToolsDB replication is broken on tools-db-2 (errno 1595) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationError [10:44:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance metricsinfra-puppetmaster-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:45:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:45:29] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:50:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:53:34] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10taavi) This is a bug in the error handling apparently, but the error seems to be: ` The CronJob "k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en" is invalid: metadat... [10:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:54:41] 10Toolforge Jobs framework: Better validate job names - https://phabricator.wikimedia.org/T351705 (10taavi) [10:54:45] 10Toolforge Jobs framework: Better validate job names - https://phabricator.wikimedia.org/T351705 (10taavi) a:03taavi [10:59:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:24:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:29:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:30:29] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:24:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:25:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [12:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:05:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:09:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:10:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [13:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:23:34] 10Cloud-VPS, 10cloud-services-team, 10Sustainability (Incident Followup), 10User-dcaro: Move Cloud VPS auth.logs to central logging - https://phabricator.wikimedia.org/T127717 (10jbond) [13:24:03] 10Cloud-VPS, 10cloud-services-team, 10SRE, 10observability, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10jbond) 05Open→03Resolved a:03jbond All systems hav now been migrated to ossl [13:24:09] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-cli] use toolforge-weld for error handling - https://phabricator.wikimedia.org/T351459 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/13 [envvars-cli] use toolforge-weld f... [13:28:54] 10cloud-services-team: PuppetFailure cloudcontrol2005-dev:9100 Puppet failure on cloudcontrol2005-dev:9100 - https://phabricator.wikimedia.org/T351107 (10taavi) [13:28:58] 10Cloud-VPS, 10cloud-services-team: Instance deletion times out in codfw1dev - https://phabricator.wikimedia.org/T351061 (10taavi) [13:29:02] 10Cloud-VPS, 10cloud-services-team: Instance deletion times out in codfw1dev - https://phabricator.wikimedia.org/T351061 (10taavi) 05Open→03Resolved a:03taavi [13:29:06] 10cloud-services-team: PuppetFailure cloudcontrol2005-dev:9100 Puppet failure on cloudcontrol2005-dev:9100 - https://phabricator.wikimedia.org/T351107 (10taavi) 05Open→03Resolved a:03taavi [13:29:10] 10cloud-services-team: galera lock-up in codfw1dev - https://phabricator.wikimedia.org/T351281 (10taavi) 05Open→03Resolved a:03taavi [13:32:52] 10Cloud-VPS, 10SRE, 10observability: ossl rsyslog post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [13:35:40] 10Cloud-VPS, 10SRE, 10observability: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) [13:43:56] 10Cloud-VPS, 10SRE, 10observability: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) On the rsyslog side these are the errors: ` Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0] Nov 21 13:42... [13:50:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:25:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:25:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:29:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:30:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:35:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:39:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [14:58:12] 10PAWS: jupyterlab to 4.0.9 - https://phabricator.wikimedia.org/T351726 (10rook) [15:00:51] 10Cloud-VPS, 10SRE, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10Vgutierrez) @fgiunchedi seems like a mismatch on configured curves between clients and servers, could I suggest providing a more detailed TLS configuration for both rsy... [15:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [15:20:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:35:29] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:40:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:44:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [15:49:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:01:59] (PuppetZeroResources) firing: Puppet has failed generate resources on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:02:06] 10cloud-services-team: PuppetZeroResources cloudcontrol2004-dev:9100 Zero Puppet resources on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T351739 (10phaultfinder) [16:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:05:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:06:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:07:35] 10Cloud-VPS, 10SRE, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) Thank you @Vgutierrez for the suggestion, I've dug a little bit into the situation and the code and I believe the message is a red-herring, in the sense tha... [16:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:11:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:11:59] 10Cloud-VPS, 10SRE, 10observability, 10Patch-For-Review: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10Vgutierrez) nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites [16:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:16:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:39:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:44:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:45:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:49:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [16:50:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [17:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:10:41] 10cloud-services-team, 10Infrastructure-Foundations, 10SRE, 10Puppet: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) 05Open→03Resolved a:03Dzahn I am going to be bold and call it resolved. Based on my previous comments. We created a Hiera k... [17:18:25] 10PAWS, 10Pywikibot, 10Documentation, 10Pywikibot-Documentation, 10good first task: Move Pywikibot PAWS tutorial out of Jupyter notebook onto wiki - https://phabricator.wikimedia.org/T342397 (10rook) @Enag2000 this is looking quite good, thank you for working on it! The only thing that pops out to me is... [17:50:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [17:54:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:55:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:05:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:09:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:31:33] (SystemdUnitDown) firing: The service unit export_smart_data_dump.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:40:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:50:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:55:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [18:59:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:04:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:06:25] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:18:11] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:18:37] (CephSlowOps) firing: Ceph cluster in eqiad has 9 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:18:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [19:20:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:24:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:26:33] (SystemdUnitDown) resolved: The service unit export_smart_data_dump.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:28:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [19:30:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:34:03] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:35:46] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:36:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [19:41:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [19:54:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:02:53] 10Tool-inteGraality: Internal Server Error on InteGraality - https://phabricator.wikimedia.org/T351574 (10JeanFred) 05Open→03Resolved a:03JeanFred This was due to {T326266}: as the `cloudmetrics0003` host was removed, and pystatsd has the interesting behaviour of crashing out if the statsd host is unavaila... [20:04:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:05:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:09:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:25:30] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:34:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:35:30] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [20:39:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:40:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:52:02] 10Grid-Engine-to-K8s-Migration: Migrate srwiki from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320057 (10komla) @dungodung T249787 is now resolved! The updated help page can be found here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_Pywikibot_scripts [20:52:54] 10Grid-Engine-to-K8s-Migration: Migrate robokobot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320014 (10komla) @Thibaut120094 T249787 is now resolved! The updated help page is here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_Pywikibot_scripts [20:53:35] 10Grid-Engine-to-K8s-Migration: Migrate multichill from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319912 (10komla) @Multichill T249787 is now resolved! The updated help page is here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_Pywikibot_scripts [20:54:19] 10Grid-Engine-to-K8s-Migration: Migrate yfdyh-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320197 (10komla) @YFdyh000 and @Kizule T249787 is now resolved! The updated help page is here: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_Pywikibot_scripts [21:04:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:05:05] 10Grid-Engine-to-K8s-Migration: Migrate musikbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319914 (10komla) @MusikAnimal have you taken another look at this T254636 was resolved? [21:09:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:11:37] 10Grid-Engine-to-K8s-Migration: Migrate isprangefinder from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319820 (10komla) @SQL have you had a chance to take a look at this? [21:12:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:14:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:15:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:17:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:19:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:20:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:21:45] (ProbeDown) firing: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:24:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:25:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:26:45] (ProbeDown) resolved: Service tools-k8s-haproxy-4:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:34:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:39:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:04:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:09:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:10:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:14:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:24:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:25:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [22:34:04] (PuppetAgentStaleLastRun) firing: (3) Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:50:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance metricsinfra-haproxy-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [23:10:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance metricsinfra-alertmanager-1 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources