[00:06:33] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[00:10:03] <wmcs-alerts>	 (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[00:19:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[00:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[00:43:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[00:52:23] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[00:53:15] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[00:56:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[00:58:40] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[00:59:32] <jinxer-wm>	 (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[01:07:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[01:09:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[01:10:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[01:39:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[01:44:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[01:59:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[02:01:34] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[02:01:39] <wikibugs>	 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1005:9100 Unit keystone_rotate_keys.service on node cloudcontrol1005 has been down for long. - https://phabricator.wikimedia.org/T350207 (10phaultfinder)
[02:03:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[02:09:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[02:17:50] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[02:26:27] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[02:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[03:02:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[03:06:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[03:11:10] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) In order to see if things are degrading, here's a point in time slice:  P53122
[03:11:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[03:11:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[03:29:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[03:49:03] <wmcs-alerts>	 (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[03:55:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[04:00:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[04:06:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[04:09:23] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[04:53:15] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[04:56:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[05:07:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[05:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[06:01:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[06:03:46] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[06:20:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[06:30:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[06:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[06:48:46] <jinxer-wm>	 (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[07:02:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[07:05:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[07:06:49] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[07:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[07:11:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[07:11:49] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[07:15:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[08:02:37] <jinxer-wm>	 (CephSlowOps) firing: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[08:02:44] <wikibugs>	 10cloud-services-team: CephSlowOps  Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder)
[08:06:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[08:07:37] <jinxer-wm>	 (CephSlowOps) resolved: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps
[08:32:56] <wmcs-alerts>	 (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[08:36:33] <jinxer-wm>	 (SystemdUnitDown) firing: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[08:53:15] <jinxer-wm>	 (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[08:56:49] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[09:06:33] <jinxer-wm>	 (SystemdUnitDownForLong) resolved: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[09:06:34] <jinxer-wm>	 (SystemdUnitDownForLong) resolved: The systemd unit keystone_rotate_keys.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[09:06:38] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors
[09:06:39] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[09:06:40] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=99)
[09:06:44] <jinxer-wm>	 (SystemdUnitDownForLong) resolved: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[09:06:49] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors
[09:06:52] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0)
[09:07:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[09:08:00] <jinxer-wm>	 (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[09:11:33] <jinxer-wm>	 (SystemdUnitDown) resolved: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[09:11:33] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[09:12:56] <wmcs-alerts>	 (ToolsGridQueueProblem) resolved: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem
[09:15:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:25:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:33:20] <wikibugs>	 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) 05Resolved→03Open Aaaaand its back again.  Example: mix-n-match tool /data/project/mix-n-match/classes/UpdateCatalog.php  Can someone please restart the `sssd-nss.service` and maybe put an auto-r...
[09:33:35] <wmcs-alerts>	 (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[09:43:51] <wikibugs>	 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) 05Open→03Resolved Seems to be fixed now?
[09:48:34] <wikibugs>	 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10taavi) Yeah, I restarted it. I'm working on a patch to automate that.
[09:53:35] <wmcs-alerts>	 (ProbeDown) resolved: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown
[10:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[10:50:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:00:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:02:48] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[11:03:24] <wikibugs>	 10Toolforge, 10cloud-services-team: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) p:05Triage→03High
[11:05:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:11:37] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10WMDE-Fisch) a:03WMDE-Fisch
[11:13:47] <wikibugs>	 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10RobiH)
[11:14:03] <wikibugs>	 10Toolforge, 10cloud-services-team: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) The instances are using `g3.cores8.ram36.disk20`, so I'm a bit surprised they're running out of RAM.
[11:18:38] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10WMDE-Fisch) Hej @Aklapper and thanks for the ping. Honestly the repo is not really actively used by #wmde-techwish . A former colleague on...
[11:30:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[11:47:27] <wikibugs>	 10Toolforge Jobs framework: webservice script broken - https://phabricator.wikimedia.org/T350250 (10Magnus)
[11:48:54] <wikibugs>	 10Toolforge Jobs framework: webservice script broken - https://phabricator.wikimedia.org/T350250 (10Magnus) Webservice is not running, returns only blank page, no error
[11:53:40] <wikibugs>	 10Toolforge: webservice script broken - https://phabricator.wikimedia.org/T350250 (10taavi) `webservice` is supposed to give a better error message, but this is happening because you forgot to `become` a tool account before running the command.
[12:00:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[12:10:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[12:49:26] <wikibugs>	 10Toolforge: webservice script broken - https://phabricator.wikimedia.org/T350250 (10Magnus) 05Open→03Resolved a:03Magnus D'oh!
[13:04:47] <wikibugs>	 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/31
[13:04:56] <notefromgithub>	 vivian-rook opened https://github.com/toolforge/quarry/pull/31
[13:07:48] <jinxer-wm>	 (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[13:09:01] <wikibugs>	 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10daniel) 05Open→03Declined We'll want a fresh start on dependency analysis. The Ruprecht projet was an early proof of concept.
[13:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[13:14:36] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: Check if nfs-maps.wikimedia.org is still in use - https://phabricator.wikimedia.org/T350259 (10taavi)
[13:23:39] <wikibugs>	 10VPS-project-Codesearch, 10User-MarcoAurelio: Include a "Report bug" type link in CodeSearch footer - https://phabricator.wikimedia.org/T346073 (10Aklapper) (IMHO ideally there would be two separate links: To report a bug at https://phabricator.wikimedia.org/maniphest/task/edit/form/43/ and to request a featu...
[13:25:07] <wikibugs>	 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10taavi) 05Declined→03Open The `ruprecht` tool is still running on the grid engine. If it's no longer used, please stop the grid web servi...
[13:41:04] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/17  [envvars-api.quota] create quota endpoint
[13:44:39] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/9  [envvars_quota] add toolforge envvars quota command
[13:50:17] <wikibugs>	 10wikitech.wikimedia.org, 10TimedMediaHandler: Support video on wikitech wiki - https://phabricator.wikimedia.org/T174476 (10TheDJ)
[13:52:08] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/124  envvars-api: bump to 0.0.32-20231101134104-2436443d
[13:52:14] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
[14:03:53] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) @VRiley-WMF cloudvirt-wdqs1002 is showing a media/cable failure when it tries to boot over network:  {F41426317,width=600}  That could be that the...
[14:14:21] <wikibugs>	 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Papaul) @Jclark-ctr @VRiley-WMF those are ready now for OS install. Thanks
[14:14:42] <icinga-wm>	 PROBLEM - Host cloudcephosd1026 is DOWN: PING CRITICAL - Packet loss = 100%
[14:19:31] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics
[14:19:39] <wikibugs>	 10Toolforge, 10cloud-services-team, 10Patch-For-Review: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125  wmcs-k8s-metrics: rollback tools
[14:19:48] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics
[14:20:15] <wikibugs>	 10Toolforge, 10cloud-services-team, 10Patch-For-Review: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125  wmcs-k8s-metrics: rollback tools
[14:20:58] <icinga-wm>	 RECOVERY - Host cloudcephosd1026 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[14:21:50] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) >>! In T346948#9291643, @VRiley-WMF wrote: > cloudvirt-wdqs1003 has been relocated  >  > cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015 >...
[14:22:19] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) 05Resolved→03Open
[14:22:24] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi)
[14:22:27] <wikibugs>	 10Toolforge, 10cloud-services-team, 10Patch-For-Review: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi)
[14:22:29] <wikibugs>	 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi)
[14:25:33] <wmcs-alerts>	 (InstanceDown) resolved: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[14:26:20] <wikibugs>	 10Tools: 'deletion-notification-bot-2' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349898 (10mdaniels5757) Yikes! Thanks for catching this. I'll probably be able to implement some sort of fix in a few days.
[14:26:40] <icinga-wm>	 PROBLEM - Host cloudcephosd1027 is DOWN: PING CRITICAL - Packet loss = 100%
[14:31:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[14:32:38] <icinga-wm>	 RECOVERY - Host cloudcephosd1027 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[14:34:26] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) @cmooney I have replaced the DAC cable and updated Netbox with the CableID; also I reseated the NIC for good measure. It is plugged into the sa...
[14:36:33] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[14:36:38] <icinga-wm>	 PROBLEM - Host cloudcephosd1028 is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:08] <icinga-wm>	 RECOVERY - Host cloudcephosd1028 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[14:44:07] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr)
[14:47:04] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[14:47:47] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[14:48:47] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[14:49:28] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[14:49:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643)
[14:50:11] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[14:50:17] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643)
[14:52:15] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[14:52:56] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643)
[14:55:18] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[15:02:49] <jinxer-wm>	 (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[15:10:49] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[15:12:27] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed wit...
[15:16:33] <wmcs-alerts>	 (InstanceDown) resolved: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[15:23:23] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm
[15:25:37] <jinxer-wm>	 (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[15:26:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[15:40:34] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) slice from this morning:  P53124  (by the way, I'm generating that with    ` cumin1001:~$ sudo cumin "P{clou...
[15:56:54] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) @Jclark-ctr had a look at the NIC riser card wasn't properly seated.  After re-seating the card the server connection seems to be working, current...
[15:57:21] <wikibugs>	 10wikitech.wikimedia.org, 10TimedMediaHandler: Support video on wikitech wiki - https://phabricator.wikimedia.org/T174476 (10bd808) Probably best fixed after {T237773}/{T292707}
[16:05:08] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9bd4e38-25ed-4ed0-bdf7-47bd52027bdc) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) an...
[16:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[16:09:30] <wikibugs>	 10Tool-ducttape, 10Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator, 10Abstract Wikipedia Fix-It tasks: Improve CI for Wikifunctions services to better test like reality - https://phabricator.wikimedia.org/T350284 (10Jdforrester-WMF)
[16:50:18] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: Check if nfs-maps.wikimedia.org is still in use - https://phabricator.wikimedia.org/T350259 (10bd808) I think that {T300694} would have replaced the need for the ips and hostname.
[17:00:37] <jinxer-wm>	 (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[17:01:05] <icinga-wm>	 RECOVERY - Check unit status of backup_vms on cloudbackup1004 is OK: OK: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms
[17:02:34] <jinxer-wm>	 (SystemdUnitDown) resolved: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[17:02:34] <jinxer-wm>	 (SystemdUnitDownForLong) resolved: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong
[17:15:42] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:16:31] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99)
[17:17:57] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:18:24] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[17:20:16] <wikibugs>	 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10Dzahn) a:03Dzahn
[17:20:33] <wikibugs>	 10VPS-project-Wikistats: Add dgawiki to wikistats - https://phabricator.wikimedia.org/T350233 (10Dzahn) a:03Dzahn
[17:20:55] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm completed: -...
[17:25:38] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:25:56] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99)
[17:26:56] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:27:23] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[17:36:40] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary
[17:37:11] <logmsgbot_cloud>	 !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0)
[17:39:20] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) I believe this is all done. Thank you everyone!
[17:40:45] <wikibugs>	 10Cloud-VPS, 10cloud-services-team: cloudvirt: eqiad1: connect them to cloud-private - https://phabricator.wikimedia.org/T346651 (10taavi)
[17:40:48] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) 05Open→03Resolved
[17:40:54] <wikibugs>	 10cloud-services-team (FY2023/2024-Q1), 10SRE, 10ops-eqiad, 10Goal: cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi)
[17:43:17] <wikibugs>	 10VPS-project-Wikistats: Add bjnwikiquote to wikistats - https://phabricator.wikimedia.org/T350239 (10Dzahn) a:03Dzahn
[17:51:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[18:01:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[18:36:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[18:46:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[18:59:27] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10Aklapper) @WMDE-Fisch: Before I attempt to delete (not sure if being in a Space will get into the way though I think it's unlikely), can y...
[19:00:58] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) p:05Triage→03Low a:05WMDE-Fisch→03Aklapper
[19:01:39] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper)
[19:02:20] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) > and start using GitLab in the first place with a new re-viewed submission of the code that can be shared.  Let's make...
[19:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[19:09:24] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10thiemowmde) I think there is only the live copy on https://technischewuensche.toolforge.org. Might be better to push it to GitLab...
[19:27:11] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Dzahn) It would be easy to import it into WMF Gitlab just by pasting the URL... IF... we could make it public.  If there are conce...
[19:27:58] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Dzahn) How large is it? Is it really that much work to check for private data?
[19:39:26] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[19:40:39] <wikibugs>	 (03PS1) 10Majavah: Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852
[19:54:13] <wikibugs>	 (03PS1) 10BryanDavis: dev(Makefile): Prefer Docker Compose v2 [labs/striker] - 10https://gerrit.wikimedia.org/r/970853
[19:54:15] <wikibugs>	 (03PS1) 10BryanDavis: dev: Bump GitLab container to v16.3.6 [labs/striker] - 10https://gerrit.wikimedia.org/r/970854
[19:54:17] <wikibugs>	 (03PS1) 10BryanDavis: gitlab: Handle error response JSON decode failures gracefully [labs/striker] - 10https://gerrit.wikimedia.org/r/970855
[19:55:31] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643)
[20:11:00] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643)
[20:11:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[20:16:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[20:26:23] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[20:33:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[20:36:07] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Andrew) again!  But three hours late this time.  Nov  1 20:29:33
[20:38:56] <wmcs-alerts>	 (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState
[20:46:07] <wikibugs>	 10Toolforge Build Service (Beta release): [buildservice] Bug - .m2 folder (local maven repository) is not cached between builds - https://phabricator.wikimedia.org/T350307 (10Don-vip)
[20:46:37] <wikibugs>	 10Toolforge Build Service (Beta release): [buildservice] Bug - .m2 folder (local maven repository) is not cached between builds - https://phabricator.wikimedia.org/T350307 (10Don-vip)
[21:14:57] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[21:15:06] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[21:15:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643)
[21:15:35] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643)
[21:26:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[21:41:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[21:55:12] <wikibugs>	 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10bd808) We should tell systemd to tell oomkiller not to kill our mariadb process. I think this can be done by usin...
[22:09:24] <wmcs-alerts>	 (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed
[22:36:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[22:46:03] <wmcs-alerts>	 (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[22:52:47] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Traffic-Icebox: Get traffic team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273737 (10BCornwall) 05Open→03Stalled @BBlack is this something we still want to pursue?
[22:52:50] <wikibugs>	 10Cloud-VPS, 10cloud-services-team, 10Patch-Needs-Improvement: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis - https://phabricator.wikimedia.org/T209011 (10BCornwall)
[23:00:07] <wikibugs>	 10Tools: QuickStatements anti-abuse measure (rate limit?) - Cannot automatically assign ID - https://phabricator.wikimedia.org/T350262 (10Aklapper) https://toolsadmin.wikimedia.org/tools/id/quickstatements has no info where the QuickStatements issue tracker is but I'd assume this must be reported at https://gith...
[23:13:24] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643)
[23:15:51] <wikibugs>	 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) > How large is it? Is it really that much work to check for private data?  I'd say that's irrelevant. Statement was that...
[23:39:41] <jinxer-wm>	 (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse
[23:52:35] <wikibugs>	 10Tools: QuickStatements anti-abuse measure (rate limit?) - Cannot automatically assign ID - https://phabricator.wikimedia.org/T350262 (10M2k_dewiki) Also see https://github.com/magnusmanske/quickstatements/issues/51