[00:10:03] <wmcs-alerts>	 (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[00:19:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[00:20:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[00:23:59] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[00:34:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[00:35:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[00:48:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[00:53:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[00:54:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[00:55:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[01:03:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[01:04:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[01:04:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[01:07:44] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[01:07:48] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[01:07:48] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[01:09:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[01:29:26] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319800 (10Huji) 05Open→03In progress
[01:29:40] <wikibugs>	 10Grid-Engine-to-K8s-Migration, 10User-Huji: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319800 (10Huji)
[01:30:38] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate checkdictation-fa from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319623 (10Huji) @Ladsgroup any chance you could take this on?
[01:44:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[01:51:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on clouddumps1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:52:04] <wikibugs>	 10cloud-services-team: PuppetFailure clouddumps1002:9100 Puppet failure on clouddumps1002:9100 - https://phabricator.wikimedia.org/T350096 (10phaultfinder)
[01:59:06] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[01:59:12] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[01:59:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[02:14:06] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Andrew) This happened yesterday, and again today.  Oct 29 17:23:09 Oct 30 17:32:29
[02:23:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[02:26:53] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[02:27:30] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (348643)
[02:27:37] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[02:27:55] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[02:49:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[03:28:59] <jinxer-wm>	 (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudweb2002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[04:04:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[04:13:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[04:33:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[05:09:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[05:09:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[05:14:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[05:20:10] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[05:29:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[05:52:14] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on clouddumps1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[06:16:00] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643)
[06:49:33] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[07:04:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[08:14:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[08:19:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[08:57:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cloudcontrol1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:58:04] <wikibugs>	 10cloud-services-team: PuppetFailure cloudcontrol1005:9100 Puppet failure on cloudcontrol1005:9100 - https://phabricator.wikimedia.org/T350115 (10phaultfinder)
[09:01:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on clouddumps1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:02:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on cloudcontrol1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:08:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:12:02] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate checkdictation-fa from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319623 (10Ladsgroup) It's a really old code that I don't know how it even works. I also really think it needs a rewrite to use [[http://hunspell.github.io/|hunspell-fa]...
[09:29:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[09:34:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[09:56:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: HAProxy service mysql backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[10:01:19] <jinxer-wm>	 (HAProxyBackendUnavailable) resolved: HAProxy service mysql backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[10:04:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[10:32:16] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) mariadb-server has been updated on all 3 hosts, and the cluster is looking fine: `SHOW STATUS LIKE "wsrep_local_state_comment";` returns `Synced` on all hosts,...
[10:37:36] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudcontrol1007.eqiad.wmnet with OS bookworm
[10:40:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: (14) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[10:40:40] <jinxer-wm>	 (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[10:40:40] <jinxer-wm>	 (GaleraClusterSizeMismatch) firing: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[10:41:19] <jinxer-wm>	 (HAProxyServiceUnavailable) firing: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable
[10:41:19] <jinxer-wm>	 (HAProxyServiceUnavailable) firing: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable
[10:41:25] <wikibugs>	 10cloud-services-team: HAProxyServiceUnavailable cloudlb1002:9900 - https://phabricator.wikimedia.org/T350127 (10phaultfinder)
[10:41:27] <wikibugs>	 10cloud-services-team: HAProxyServiceUnavailable cloudlb1001:9900 - https://phabricator.wikimedia.org/T350128 (10phaultfinder)
[10:45:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: (17) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[10:46:19] <jinxer-wm>	 (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1002:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable
[10:46:19] <jinxer-wm>	 (HAProxyServiceUnavailable) resolved: (2) HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable
[10:53:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[11:01:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[11:07:28] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: WMCS public range diffscan - https://phabricator.wikimedia.org/T206653 (10taavi) a:03taavi
[11:11:37] <jinxer-wm>	 (CephSlowOps) firing: Ceph cluster in eqiad has 5 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[11:11:48] <wikibugs>	 10cloud-services-team: CephSlowOps  Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder)
[11:15:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[11:15:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[11:20:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[11:20:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[11:21:37] <jinxer-wm>	 (CephSlowOps) resolved: Ceph cluster in eqiad has 2 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[11:23:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[11:24:31] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudcontrol1007.eqiad.wmnet with OS bookworm completed: - cloudcontrol10...
[11:25:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: (13) HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[11:25:40] <jinxer-wm>	 (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[11:25:40] <jinxer-wm>	 (GaleraClusterSizeMismatch) resolved: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch
[11:32:03] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Security: cloudvirt hosts ssh and node-exporter ports are reachable from instances via cloud-private - https://phabricator.wikimedia.org/T350130 (10taavi) 05Open→03Resolved Patch deployed.
[11:32:09] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Security: cloudvirt hosts ssh and node-exporter ports are reachable from instances via cloud-private - https://phabricator.wikimedia.org/T350130 (10taavi)
[11:38:16] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10netops: Restrict traffic from instances to private IPs on cloudgw level - https://phabricator.wikimedia.org/T350132 (10taavi)
[11:53:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:58:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[12:08:45] <jinxer-wm>	 (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[12:13:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[12:22:48] <wikibugs>	 10Tools, 10Commons: ZoomViewer produces a 503 error - https://phabricator.wikimedia.org/T343796 (10WMDE-Fisch) The brokenness of this tool made it to #wmde-techwish's current experimental "We try to help fixing some tools" working mode. And I had a look to figure out if we could help here.  >>! In T343796#9119...
[12:53:04] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[12:53:37] <wikibugs>	 10cloud-services-team (FY2023/2024-Q1), 10Infrastructure-Foundations, 10Packaging: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 (10jcrespo) ` reprepro changes: add bookworm-wikimedia deb main amd64 wmfbackups 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups_0.8.3+deb1...
[12:56:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit prometheus-openstack-exporter.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[12:56:33] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[12:56:39] <wikibugs>	 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1006:9100 Unit nova-fullstack.service on node cloudcontrol1006 has been down for long. - https://phabricator.wikimedia.org/T350144 (10phaultfinder)
[12:57:03] <wmcs-alerts>	 (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure
[12:57:03] <wmcs-alerts>	 (WidespreadPuppetAgentFailure) firing: Widespread puppet agent failures in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure
[13:00:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[13:02:03] <wmcs-alerts>	 (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure
[13:02:03] <wmcs-alerts>	 (WidespreadPuppetAgentFailure) resolved: Widespread puppet agent failures in project toolsbeta   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure
[13:05:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[13:09:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[13:12:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[13:17:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[13:19:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[13:21:33] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[13:21:38] <wikibugs>	 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 Unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T350146 (10phaultfinder)
[13:27:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[13:59:31] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate lahitools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319851 (10Lahi) @nskaggs apologies for the delay  I have disabled the tool so it can be archived and deleted. There is little value into migrating the task since it was created...
[14:07:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[14:10:10] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[14:11:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[14:12:01] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[14:14:23] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[14:17:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[14:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[14:34:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[14:35:58] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate lahitools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319851 (10komla) >>! In T319851#9295001, @Lahi wrote: > @nskaggs apologies for the delay >  > I have disabled the tool so it can be archived and deleted. There is little value...
[14:47:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393
[14:47:50] <wikibugs>	 10Tools, 10Commons: ZoomViewer produces a 503 error - https://phabricator.wikimedia.org/T343796 (10dschwen) This is weird:  for  ` https://zoomviewer.toolforge.org/fcgi-bin/iipsrv.fcgi?FIF=cache/779543aa14d92a2dff180a4cbc0eb2f6.tif&obj=IIP,1.0&obj=Max-size&obj=Tile-size&obj=Resolution-number `  I get  ` Unable...
[14:48:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add fake ssh private key for docker::builder [labs/private] - 10https://gerrit.wikimedia.org/r/970393 (owner: 10Giuseppe Lavagetto)
[14:49:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[14:56:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395
[14:56:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] docker::builder: strings must be strings in yaml [labs/private] - 10https://gerrit.wikimedia.org/r/970395 (owner: 10Giuseppe Lavagetto)
[14:58:03] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri)
[15:01:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[15:25:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[15:29:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[15:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[15:36:03] <wmcs-alerts>	 (PuppetAgentDisabled) firing: (2) Puppet agent disabled on instance quarry-dev-03 in project quarry   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentDisabled
[15:52:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[16:02:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[16:05:20] <jinxer-wm>	 (HAProxyBackendUnavailable) resolved: (2) HAProxy service keystone-admin-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[16:06:33] <jinxer-wm>	 (SystemdUnitDown) firing: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:09:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[16:15:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
[16:16:56] <wikibugs>	 10Cloud Services Proposals, 10Toolforge (Toolforge iteration 02): Decision request – Toolforge CLI consolidation - https://phabricator.wikimedia.org/T348749 (10fnegri)
[16:27:39] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed with...
[16:28:59] <wikibugs>	 10Horizon: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime)
[16:31:40] <wikibugs>	 10Horizon, 10cloud-services-team: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime)
[16:34:04] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate mgp-cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319890 (10Leranjun) a:05Leranjun→03Kanashimi Reassigning this issue to @Kanashimi as this tool is linked to cewbot in T319622
[16:35:09] <wikibugs>	 10Horizon, 10cloud-services-team: Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) nb. also sometimes getting an error popup with `Error: Project switch failed for user "Samtar".`
[16:36:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:36:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:36:41] <icinga-wm>	 PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:02] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate mgp-cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319890 (10Leranjun)
[16:37:04] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate cewbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319622 (10Leranjun)
[16:43:47] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate lihaohong-bot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319856 (10Leranjun) a:03lihaohong
[16:44:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:49:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:52:59] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[16:55:55] <icinga-wm>	 PROBLEM - Check systemd state on cloudservices1006 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:56:40] <wikibugs>	 (03PS1) 10David Caro: some fixes, to sort out [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414
[16:56:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[16:57:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[16:58:28] <wikibugs>	 (03CR) 10David Caro: "@andrewq, you can use this to manage the ceph downyime stuff, the one i was usin is the 'drain_node' cookbook passing more than one node" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 (owner: 10David Caro)
[17:00:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] some fixes, to sort out [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/970414 (owner: 10David Caro)
[17:01:14] <wikibugs>	 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) 05Open→03In progress p:05Triage→03High
[17:01:20] <wikibugs>	 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) a:03fnegri
[17:02:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:06:29] <icinga-wm>	 PROBLEM - Check unit status of backup_vms on cloudbackup1004 is CRITICAL: CRITICAL: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms
[17:07:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:12:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:17:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:19:19] <jinxer-wm>	 (HAProxyBackendUnavailable) firing: HAProxy service keystone-public-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[17:21:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit prometheus-openstack-exporter.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[17:24:19] <jinxer-wm>	 (HAProxyBackendUnavailable) resolved: HAProxy service keystone-public-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable
[17:27:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:30:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:23] <wikibugs>	 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) I regenerated the Fernet tokens as described at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_or_revoking_keystone_fernet_tokens...
[17:32:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:32:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[17:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[17:34:21] <icinga-wm>	 PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:31] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) And one more time, seems to be roughly around the same time every day:  ` Oct 31 17:27:00 tools-db-1 syst...
[17:37:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[17:44:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:46:23] <wikibugs>	 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10fnegri) 05In progress→03Resolved Resolving for now, please reopen if you see more issues.
[17:47:56] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10taavi) It seems like there's some scheduled job that's running a query that's causing MariaDB to crash?
[17:49:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:51:44] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) That's also my assumption, but I'm not sure how to find it!
[17:54:02] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) The Systemctl unit calls `/usr/local/sbin/prometheus-openstack-exporter-wrapper` that in turn calls `/usr/bin/prometheus-openstack-exporter`. The latter...
[17:57:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:01:33] <jinxer-wm>	 (SystemdUnitDownForLong) firing: (2) The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[18:01:39] <wikibugs>	 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 - https://phabricator.wikimedia.org/T350178 (10phaultfinder)
[18:02:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:03:35] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10taavi) I think the current exporter version we're supposed to be using is written in Go, so that seems very wrong.
[18:06:28] <wikibugs>	 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) 05Resolved→03Open ` Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'neutronclient.common....
[18:19:26] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) There's a very old version installed of that package in cloudcontrol1007:  ` ii  prometheus-openstack-exporter 0.1.4-2.2    all          Prometheus expo...
[18:20:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10taavi) Looks like we pull the new version from a local apt.wm.o component: `lang=shell-session taavi@cloudcontrol1006 ~ $ apt-cache policy prometheus-openstack-...
[18:23:31] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) Related: {https://phabricator.wikimedia.org/T302178}
[18:24:25] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/17  [envvars-api.quota] create quota endpoint
[18:27:38] <wikibugs>	 10Horizon, 10cloud-services-team (FY2023/2024-Q1): Unable to retrieve instances - https://phabricator.wikimedia.org/T350172 (10TheresNoTime) 05Open→03Resolved ` 18:13:17 <taavi> this is on cloudcontrol1007: 2023-10-31 18:13:00.079 236796 ERROR keystone PermissionError: [Errno 13] Permission denied: '/etc/k...
[18:30:07] <icinga-wm>	 RECOVERY - Check systemd state on cloudservices1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:13] <icinga-wm>	 RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:31:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[18:34:38] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) I guess the package was built and uploaded only to bullseye-wikimedia, we need to have the same package under bookworm-wikimedia.  Now if only I could f...
[18:36:50] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) This might be what I was looking for: https://wikitech.wikimedia.org/wiki/Reprepro#Copying_between_distributions
[18:39:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[18:41:12] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[18:43:17] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10taavi) Yes, but you need to define the component in the reprepro config file first.
[18:48:33] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) ah-ha here's why the reprepro command was failing!
[18:57:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[19:01:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[19:02:34] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[19:05:13] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Patch-For-Review: [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) This seems to have worked:  ` root@apt1001:~# reprepro -C component/prometheus-openstack-exporter copy bookworm-wikimedia bullseye...
[19:09:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[19:10:03] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Patch-For-Review: [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) Re-run puppet on cloudcontrol1007 and it's looking good:  ` root@cloudcontrol1007:~# systemctl status prometheus-openstack-exporte...
[19:11:33] <jinxer-wm>	 (SystemdUnitDown) firing: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[19:11:34] <jinxer-wm>	 (SystemdUnitDownForLong) firing: (2) The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[19:17:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[19:23:17] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 (10fnegri)
[19:23:43] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/9  [envvars_quota] add toolforge envvars quota command
[19:23:48] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 (10fnegri) 05Open→03In progress p:05Triage→03High
[19:23:52] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri)
[19:24:45] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) https://quarry-test.wmcloud.org offers a running, but not working, quarry on k8s. When I run a query it is giving: ` Can't connect to MySQL server on 'enwiki' ([Errno -2] Name or service not known) ` Presumably someth...
[19:25:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[19:26:02] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) 05Open→03Resolved
[19:26:06] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri)
[19:27:51] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri)
[19:28:03] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) 05Resolved→03Open Reopening because I want to enable the exporter in codfw as well, so we will catch similar issues in the future when testing in co...
[19:28:44] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) 05Open→03In progress
[19:28:48] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri)
[19:29:16] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] prometheus exporter broken in bookworm - https://phabricator.wikimedia.org/T350154 (10fnegri) p:05Triage→03High
[19:42:17] <wikibugs>	 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10Dzahn) @RhinosF1 As far as I remember our import script deleted the whole table and then freshly added all wikis it gets from the Miraheze API call. Shouldn't that me...
[19:42:57] <wikibugs>	 10VPS-project-Wikistats, 10collaboration-services, 10User-RhinosF1: Add 'wikitide' to wikistats - https://phabricator.wikimedia.org/T349660 (10Dzahn) Let's talk about this in a scheduled time slot for wikistats work (per IRC chat today), maybe next week ?
[19:55:51] <wikibugs>	 10Toolforge (Toolforge iteration 02): [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Raymond_Ndibe)
[20:06:55] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/18  [envvars-ap...
[20:08:26] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [envvars-api] avoid invalidating go mod download cache on each code change - https://phabricator.wikimedia.org/T350193 (10Raymond_Ndibe) 05Open→03In progress
[20:15:34] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate pb from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319957 (10Euku) 05Open→03Resolved As the link above (https://grid-deprecation.toolforge.org/t/pb) shows the tool was already migrated (weeks ago). Setting ticket to "Resolved".
[20:16:00] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate mp from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319909 (10Euku) 05Open→03Resolved As the link above (https://grid-deprecation.toolforge.org/t/mp) shows the tool was already migrated (weeks ago). Setting ticket to "Resolved".
[20:20:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[20:29:03] <wmcs-alerts>	 (InstanceDown) firing: Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[20:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[20:33:08] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10SD0001) @rook This is due to misconfigured db config. I can see config.yaml has `REPLICA_DOMAIN: ''` which could be overriding the valid value provided a few lines above it.
[20:53:14] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[20:53:50] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10rook) >>! In T349032#9296495, @SD0001 wrote: > @rook This is due to misconfigured db config. I can see config.yaml has `REPLICA_DOMAIN: ''` which could be overriding the valid value provided a few lines above it.  ooo so it...
[20:56:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[21:01:11] <wikibugs>	 10Grid-Engine-to-K8s-Migration: Migrate lahitools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319851 (10Aklapper) @Lahi: https://toolsadmin.wikimedia.org/ should allow you to delete the tool
[21:07:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[21:24:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[21:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[22:01:53] <wikibugs>	 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1007:9100 Unit keystone_rotate_keys.service on node cloudcontrol1007 has been down for long. - https://phabricator.wikimedia.org/T350198 (10phaultfinder)
[22:09:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[22:43:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[23:02:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[23:03:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: (8) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[23:06:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[23:11:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[23:11:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[23:24:56] <wikibugs>	 10Tool-Pageviews, 10Data-Engineering, 10Data-Engineering-Wikistats, 10Inuka-Team, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Htriedman) 05Open→03Resolved a:03Htriedman
[23:28:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: (7) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[23:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[23:44:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[23:45:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[23:49:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[23:53:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: (6) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse