[00:10:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:19:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:20:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:23:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:34:03] (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:35:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [00:48:45] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:53:45] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:54:32] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:55:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [01:03:45] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:04:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:04:32] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:07:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [01:07:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [01:07:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [01:09:32] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:29:26] 10Grid-Engine-to-K8s-Migration: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319800 (10Huji) 05Open→03In progress [01:29:40] 10Grid-Engine-to-K8s-Migration, 10User-Huji: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319800 (10Huji) [01:30:38] 10Grid-Engine-to-K8s-Migration: Migrate checkdictation-fa from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319623 (10Huji) @Ladsgroup any chance you could take this on? [01:44:32] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:51:59] (PuppetFailure) firing: Puppet has failed on clouddumps1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:52:04] 10cloud-services-team: PuppetFailure clouddumps1002:9100 Puppet failure on clouddumps1002:9100 - https://phabricator.wikimedia.org/T350096 (10phaultfinder) [01:59:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [01:59:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [01:59:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [02:14:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Andrew) This happened yesterday, and again today. Oct 29 17:23:09 Oct 30 17:32:29 [02:23:45] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:26:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [02:27:30] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (348643) [02:27:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [02:27:55] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [02:49:32] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:28:59] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:04:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:13:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [04:33:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:09:03] (InstanceDown) firing: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:09:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:14:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:20:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [05:29:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [05:52:14] (PuppetFailure) firing: Puppet has failed on clouddumps1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:16:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [06:49:33] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:04:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:14:03] (InstanceDown) firing: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:19:03] (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:57:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:58:04] 10cloud-services-team: PuppetFailure cloudcontrol1005:9100 Puppet failure on cloudcontrol1005:9100 - https://phabricator.wikimedia.org/T350115 (10phaultfinder) [09:01:59] (PuppetFailure) resolved: Puppet has failed on clouddumps1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:02:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:08:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:12:02] 10Grid-Engine-to-K8s-Migration: Migrate checkdictation-fa from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319623 (10Ladsgroup) It's a really old code that I don't know how it even works. I also really think it needs a rewrite to use [[http://hunspell.github.io/|hunspell-fa]... [09:29:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [09:34:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [09:56:19] (HAProxyBackendUnavailable) firing: HAProxy service mysql backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:01:19] (HAProxyBackendUnavailable) resolved: HAProxy service mysql backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:04:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:32:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) mariadb-server has been updated on all 3 hosts, and the cluster is looking fine: `SHOW STATUS LIKE "wsrep_local_state_comment";` returns `Synced` on all hosts,... [10:37:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudcontrol1007.eqiad.wmnet with OS bookworm [10:40:19] (HAProxyBackendUnavailable) firing: (14) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:40:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [10:40:40] (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [10:41:19] (HAProxyServiceUnavailable) firing: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [10:41:19] (HAProxyServiceUnavailable) firing: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [10:41:25] 10cloud-services-team: HAProxyServiceUnavailable cloudlb1002:9900 - https://phabricator.wikimedia.org/T350127 (10phaultfinder) [10:41:27] 10cloud-services-team: HAProxyServiceUnavailable cloudlb1001:9900 - https://phabricator.wikimedia.org/T350128 (10phaultfinder) [10:45:19] (HAProxyBackendUnavailable) firing: (17) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [10:46:19] (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [10:46:19] (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [10:53:45] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:01:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:07:28] 10Cloud-VPS, 10cloud-services-team: WMCS public range diffscan - https://phabricator.wikimedia.org/T206653 (10taavi) a:03taavi [11:11:37] (CephSlowOps) firing: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [11:11:48] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [11:15:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:15:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [11:20:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:20:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [11:21:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [11:23:45] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:24:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudcontrol1007.eqiad.wmnet with OS bookworm completed: - cloudcontrol10... [11:25:19] (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [11:25:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [11:25:40] (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [11:32:03] 10Cloud-VPS, 10cloud-services-team, 10Security: cloudvirt hosts ssh and node-exporter ports are reachable from instances via cloud-private - https://phabricator.wikimedia.org/T350130 (10taavi) 05Open→03Resolved Patch deployed. [11:32:09] 10Cloud-VPS, 10cloud-services-team, 10Security: cloudvirt hosts ssh and node-exporter ports are reachable from instances via cloud-private - https://phabricator.wikimedia.org/T350130 (10taavi) [11:38:16] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10netops: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10taavi) [11:53:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:58:45] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:08:45] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:13:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:22:48] 10Tools, 10Commons: ZoomViewer produces a 503 error - https://phabricator.wikimedia.org/T343796 (10WMDE-Fisch) The brokenness of this tool made it to #wmde-techwish's current experimental "We try to help fixing some tools" working mode. And I had a look to figure out if we could help here. >>! In T343796#9119... [12:53:04] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:53:37] 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) ` reprepro changes: add bookworm-wikimedia deb main amd64 wmfbackups 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups_0.8.3+deb1... [12:56:33] (SystemdUnitDown) firing: The service unit prometheus-openstack-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:56:33] (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [12:56:39] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1006:9100 Unit nova-fullstack.service on node cloudcontrol1006 has been down for long. - https://phabricator.wikimedia.org/T350144 (10phaultfinder) [12:57:03] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [12:57:03] (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [13:00:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:02:03] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [13:02:03] (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [13:05:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:09:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:12:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:17:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:19:32] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [13:21:33] (SystemdUnitDownForLong) firing: The systemd unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [13:21:38] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 Unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T350146 (10phaultfinder) [13:27:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:59:31] 10Grid-Engine-to-K8s-Migration: Migrate lahitools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319851 (10Lahi) @nskaggs apologies for the delay I have disabled the tool so it can be archived and deleted. There is little value into migrating the task since it was created... [14:07:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:10:10] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [14:11:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [14:12:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [14:14:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [14:17:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [14:34:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:35:58] 10Grid-Engine-to-K8s-Migration: Migrate lahitools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319851 (10komla) >>! In T319851#9295001, @Lahi wrote: > @nskaggs apologies for the delay > > I have disabled the tool so it can be archived and deleted. There is little value... [14:47:41] (03PS1) 10Giuseppe Lavagetto: Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393 [14:47:50] 10Tools, 10Commons: ZoomViewer produces a 503 error - https://phabricator.wikimedia.org/T343796 (10dschwen) This is weird: for ` https://zoomviewer.toolforge.org/fcgi-bin/iipsrv.fcgi?FIF=cache/779543aa14d92a2dff180a4cbc0eb2f6.tif&obj=IIP,1.0&obj=Max-size&obj=Tile-size&obj=Resolution-number ` I get ` Unable... [14:48:04] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393 (owner: 10Giuseppe Lavagetto) [14:49:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:56:05] (03PS1) 10Giuseppe Lavagetto: docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395 [14:56:47] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395 (owner: 10Giuseppe Lavagetto) [14:58:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) [15:01:48] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:25:19] (HAProxyBackendUnavailable) firing: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:29:32] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [15:36:03] (PuppetAgentDisabled) firing: (2) Puppet agent disabled on instance quarry-dev-03 in project quarry - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentDisabled [15:52:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:02:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:05:20] (HAProxyBackendUnavailable) resolved: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:06:33] (SystemdUnitDown) firing: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:09:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:15:54] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [16:16:56] 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 02): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10fnegri) [16:27:39] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with... [16:28:59] 10Horizon: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) [16:31:40] 10Horizon, 10cloud-services-team: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) [16:34:04] 10Grid-Engine-to-K8s-Migration: Migrate mgp-cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319890 (10Leranjun) a:05Leranjun→03Kanashimi Reassigning this issue to @Kanashimi as this tool is linked to cewbot in T319622 [16:35:09] 10Horizon, 10cloud-services-team: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) nb. also sometimes getting an error popup with `Error: Project switch failed for user "Samtar".` [16:36:33] (SystemdUnitDown) firing: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:36:33] (SystemdUnitDown) firing: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:36:41] PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:02] 10Grid-Engine-to-K8s-Migration: Migrate mgp-cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319890 (10Leranjun) [16:37:04] 10Grid-Engine-to-K8s-Migration: Migrate cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319622 (10Leranjun) [16:43:47] 10Grid-Engine-to-K8s-Migration: Migrate lihaohong-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319856 (10Leranjun) a:03lihaohong [16:44:33] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:49:33] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:52:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [16:55:55] PROBLEM - Check systemd state on cloudservices1006 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:56:40] (03PS1) 10David Caro: some fixes, to sort out [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 [16:56:48] (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [16:57:33] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:58:28] (03CR) 10David Caro: "@andrewq, you can use this to manage the ceph downyime stuff, the one i was usin is the 'drain_node' cookbook passing more than one node" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 (owner: 10David Caro) [17:00:24] (03CR) 10CI reject: [V: 04-1] some fixes, to sort out [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 (owner: 10David Caro) [17:01:14] 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) 05Open→03In progress p:05Triage→03High [17:01:20] 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) a:03fnegri [17:02:33] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:06:29] PROBLEM - Check unit status of backup_vms on cloudbackup1004 is CRITICAL: CRITICAL: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms [17:07:33] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:12:33] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:17:33] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:19:19] (HAProxyBackendUnavailable) firing: HAProxy service keystone-public-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:21:48] (SystemdUnitDownForLong) firing: The systemd unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [17:24:19] (HAProxyBackendUnavailable) resolved: HAProxy service keystone-public-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:27:33] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:30:09] RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:23] 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) I regenerated the Fernet tokens as described at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_or_revoking_keystone_fernet_tokens... [17:32:33] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:32:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [17:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [17:34:21] PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) And one more time, seems to be roughly around the same time every day: ` Oct 31 17:27:00 tools-db-1 syst... [17:37:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [17:44:33] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:46:23] 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) 05In progress→03Resolved Resolving for now, please reopen if you see more issues. [17:47:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10taavi) It seems like there's some scheduled job that's running a query that's causing MariaDB to crash? [17:49:33] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:51:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) That's also my assumption, but I'm not sure how to find it! [17:54:02] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) The Systemctl unit calls `/usr/local/sbin/prometheus-openstack-exporter-wrapper` that in turn calls `/usr/bin/prometheus-openstack-exporter`. The latter... [17:57:33] (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:01:33] (SystemdUnitDownForLong) firing: (2) The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [18:01:39] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 - https://phabricator.wikimedia.org/T350178 (10phaultfinder) [18:02:33] (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:03:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10taavi) I think the current exporter version we're supposed to be using is written in Go, so that seems very wrong. [18:06:28] 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) 05Resolved→03Open ` Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) There's a very old version installed of that package in cloudcontrol1007: ` ii prometheus-openstack-exporter 0.1.4-2.2 all Prometheus expo... [18:20:54] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10taavi) Looks like we pull the new version from a local apt.wm.o component: `lang=shell-session taavi@cloudcontrol1006 ~ $ apt-cache policy prometheus-openstack-... [18:23:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) Related: {https://phabricator.wikimedia.org/T302178} [18:24:25] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/17 [envvars-api.quota] create quota endpoint [18:27:38] 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) 05Open→03Resolved ` 18:13:17 this is on cloudcontrol1007: 2023-10-31 18:13:00.079 236796 ERROR keystone PermissionError: [Errno 13] Permission denied: '/etc/k... [18:30:07] RECOVERY - Check systemd state on cloudservices1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:13] RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:33] (SystemdUnitDown) resolved: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:31:33] (SystemdUnitDown) resolved: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:34:38] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) I guess the package was built and uploaded only to bullseye-wikimedia, we need to have the same package under bookworm-wikimedia. Now if only I could f... [18:36:50] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) This might be what I was looking for: https://wikitech.wikimedia.org/wiki/Reprepro#Copying_between_distributions [18:39:32] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:41:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [18:43:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10taavi) Yes, but you need to define the component in the reprepro config file first. [18:48:33] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) ah-ha here's why the reprepro command was failing! [18:57:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:01:48] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:02:34] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [19:05:13] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Patch-For-Review: [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) This seems to have worked: ` root@apt1001:~# reprepro -C component/prometheus-openstack-exporter copy bookworm-wikimedia bullseye... [19:09:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:10:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Patch-For-Review: [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) Re-run puppet on cloudcontrol1007 and it's looking good: ` root@cloudcontrol1007:~# systemctl status prometheus-openstack-exporte... [19:11:33] (SystemdUnitDown) firing: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:11:34] (SystemdUnitDownForLong) firing: (2) The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [19:17:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:23:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 (10fnegri) [19:23:43] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/9 [envvars_quota] add toolforge envvars quota command [19:23:48] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 (10fnegri) 05Open→03In progress p:05Triage→03High [19:23:52] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [19:24:45] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) https://quarry-test.wmcloud.org offers a running, but not working, quarry on k8s. When I run a query it is giving: ` Can't connect to MySQL server on 'enwiki' ([Errno -2] Name or service not known) ` Presumably someth... [19:25:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:26:02] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) 05Open→03Resolved [19:26:06] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [19:27:51] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [19:28:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) 05Resolved→03Open Reopening because I want to enable the exporter in codfw as well, so we will catch similar issues in the future when testing in co... [19:28:44] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) 05Open→03In progress [19:28:48] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [19:29:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) p:05Triage→03High [19:42:17] 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10Dzahn) @RhinosF1 As far as I remember our import script deleted the whole table and then freshly added all wikis it gets from the Miraheze API call. Shouldn't that me... [19:42:57] 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10Dzahn) Let's talk about this in a scheduled time slot for wikistats work (per IRC chat today), maybe next week ? [19:55:51] 10Toolforge (Toolforge iteration 02): [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Raymond_Ndibe) [20:06:55] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/18 [envvars-ap... [20:08:26] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Raymond_Ndibe) 05Open→03In progress [20:15:34] 10Grid-Engine-to-K8s-Migration: Migrate pb from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319957 (10Euku) 05Open→03Resolved As the link above (https://grid-deprecation.toolforge.org/t/pb) shows the tool was already migrated (weeks ago). Setting ticket to "Resolved". [20:16:00] 10Grid-Engine-to-K8s-Migration: Migrate mp from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319909 (10Euku) 05Open→03Resolved As the link above (https://grid-deprecation.toolforge.org/t/mp) shows the tool was already migrated (weeks ago). Setting ticket to "Resolved". [20:20:03] (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:29:03] (InstanceDown) firing: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [20:33:08] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10SD0001) @rook This is due to misconfigured db config. I can see config.yaml has `REPLICA_DOMAIN: ''` which could be overriding the valid value provided a few lines above it. [20:53:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:53:50] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) >>! In T349032#9296495, @SD0001 wrote: > @rook This is due to misconfigured db config. I can see config.yaml has `REPLICA_DOMAIN: ''` which could be overriding the valid value provided a few lines above it. ooo so it... [20:56:48] (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [21:01:11] 10Grid-Engine-to-K8s-Migration: Migrate lahitools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319851 (10Aklapper) @Lahi: https://toolsadmin.wikimedia.org/ should allow you to delete the tool [21:07:48] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:24:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [22:01:53] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 Unit keystone_rotate_keys.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T350198 (10phaultfinder) [22:09:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:43:46] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:02:48] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [23:03:46] (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:06:33] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:11:48] (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [23:11:48] (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [23:24:56] 10Tool-Pageviews, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Inuka-Team, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Htriedman) 05Open→03Resolved a:03Htriedman [23:28:46] (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [23:44:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:45:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [23:49:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:53:46] (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse