[00:06:33] (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:10:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:19:32] (OpenstackAPIResponse) firing: (5) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [00:43:46] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:52:23] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [00:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:56:48] (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [00:58:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [00:59:32] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [01:07:48] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:09:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:10:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:39:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:44:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:59:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [02:01:34] (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [02:01:39] 10cloud-services-team: SystemdUnitDownForLong cloudcontrol1005:9100 Unit keystone_rotate_keys.service on node cloudcontrol1005 has been down for long. - https://phabricator.wikimedia.org/T350207 (10phaultfinder) [02:03:46] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [02:09:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [02:17:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [02:26:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [02:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [03:02:48] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [03:06:48] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:11:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) In order to see if things are degrading, here's a point in time slice: P53122 [03:11:48] (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [03:11:48] (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [03:29:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:49:03] (InstanceDown) resolved: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:55:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:00:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:06:48] (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:09:23] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:56:48] (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [05:07:48] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [05:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [06:01:48] (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [06:03:46] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [06:20:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:30:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:48:46] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:02:48] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [07:05:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:06:49] (SystemdUnitDown) firing: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:11:48] (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:11:49] (SystemdUnitDownForLong) firing: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [07:15:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:02:37] (CephSlowOps) firing: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:02:44] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [08:06:48] (SystemdUnitDown) firing: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:07:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 4 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [08:32:56] (ToolsGridQueueProblem) firing: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [08:36:33] (SystemdUnitDown) firing: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:53:15] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:56:49] (SystemdUnitDownForLong) firing: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [09:06:33] (SystemdUnitDownForLong) resolved: The systemd unit keystone_rotate_keys.service on node cloudcontrol1007 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [09:06:34] (SystemdUnitDownForLong) resolved: The systemd unit keystone_rotate_keys.service on node cloudcontrol1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [09:06:38] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [09:06:39] (SystemdUnitDown) resolved: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:06:40] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=99) [09:06:44] (SystemdUnitDownForLong) resolved: The systemd unit nova-fullstack.service on node cloudcontrol1006 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [09:06:49] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.grid.cleanup_queue_errors [09:06:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.grid.cleanup_queue_errors (exit_code=0) [09:07:48] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:08:00] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:11:33] (SystemdUnitDown) resolved: (2) The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:11:33] (SystemdUnitDown) resolved: The service unit keystone_rotate_keys.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [09:12:56] (ToolsGridQueueProblem) resolved: Grid queue webgrid-lighttpd@tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud is in state E - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsGridQueueProblem - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsGridQueueProblem [09:15:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:25:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:33:20] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) 05Resolved→03Open Aaaaand its back again. Example: mix-n-match tool /data/project/mix-n-match/classes/UpdateCatalog.php Can someone please restart the `sssd-nss.service` and maybe put an auto-r... [09:33:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:43:51] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10Magnus) 05Open→03Resolved Seems to be fixed now? [09:48:34] 10Toolforge: Cannot edit files of a tool as a user anymore - https://phabricator.wikimedia.org/T349687 (10taavi) Yeah, I restarted it. I'm working on a patch to automate that. [09:53:35] (ProbeDown) resolved: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [10:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:50:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:00:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:02:48] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [11:03:24] 10Toolforge, 10cloud-services-team: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) p:05Triage→03High [11:05:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:11:37] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10WMDE-Fisch) a:03WMDE-Fisch [11:13:47] 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10RobiH) [11:14:03] 10Toolforge, 10cloud-services-team: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) The instances are using `g3.cores8.ram36.disk20`, so I'm a bit surprised they're running out of RAM. [11:18:38] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10WMDE-Fisch) Hej @Aklapper and thanks for the ping. Honestly the repo is not really actively used by #wmde-techwish . A former colleague on... [11:30:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:47:27] 10Toolforge Jobs framework: webservice script broken - https://phabricator.wikimedia.org/T350250 (10Magnus) [11:48:54] 10Toolforge Jobs framework: webservice script broken - https://phabricator.wikimedia.org/T350250 (10Magnus) Webservice is not running, returns only blank page, no error [11:53:40] 10Toolforge: webservice script broken - https://phabricator.wikimedia.org/T350250 (10taavi) `webservice` is supposed to give a better error message, but this is happening because you forgot to `become` a tool account before running the command. [12:00:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:10:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:49:26] 10Toolforge: webservice script broken - https://phabricator.wikimedia.org/T350250 (10Magnus) 05Open→03Resolved a:03Magnus D'oh! [13:04:47] 10Quarry: Deploy magnum cluster for quarry - https://phabricator.wikimedia.org/T349032 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/quarry/pull/31 [13:04:56] vivian-rook opened https://github.com/toolforge/quarry/pull/31 [13:07:48] (SystemdUnitDown) firing: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:09:01] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10daniel) 05Open→03Declined We'll want a fresh start on dependency analysis. The Ruprecht projet was an early proof of concept. [13:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:14:36] 10Cloud-VPS, 10cloud-services-team: Check if nfs-maps.wikimedia.org is still in use - https://phabricator.wikimedia.org/T350259 (10taavi) [13:23:39] 10VPS-project-Codesearch, 10User-MarcoAurelio: Include a "Report bug" type link in CodeSearch footer - https://phabricator.wikimedia.org/T346073 (10Aklapper) (IMHO ideally there would be two separate links: To report a bug at https://phabricator.wikimedia.org/maniphest/task/edit/form/43/ and to request a featu... [13:25:07] 10Grid-Engine-to-K8s-Migration, 10MediaWiki-Engineering: Migrate ruprecht from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320021 (10taavi) 05Declined→03Open The `ruprecht` tool is still running on the grid engine. If it's no longer used, please stop the grid web servi... [13:41:04] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/17 [envvars-api.quota] create quota endpoint [13:44:39] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe merged https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/9 [envvars_quota] add toolforge envvars quota command [13:50:17] 10wikitech.wikimedia.org, 10TimedMediaHandler: Support video on wikitech wiki - https://phabricator.wikimedia.org/T174476 (10TheDJ) [13:52:08] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge envvars quota` - https://phabricator.wikimedia.org/T341087 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/124 envvars-api: bump to 0.0.32-20231101134104-2436443d [13:52:14] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [14:03:53] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) @VRiley-WMF cloudvirt-wdqs1002 is showing a media/cable failure when it tries to boot over network: {F41426317,width=600} That could be that the... [14:14:21] 10cloud-services-team (Hardware), 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Papaul) @Jclark-ctr @VRiley-WMF those are ready now for OS install. Thanks [14:14:42] PROBLEM - Host cloudcephosd1026 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:31] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component wmcs-k8s-metrics [14:19:39] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10CodeReviewBot) taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125 wmcs-k8s-metrics: rollback tools [14:19:48] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component wmcs-k8s-metrics [14:20:15] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10CodeReviewBot) taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125 wmcs-k8s-metrics: rollback tools [14:20:58] RECOVERY - Host cloudcephosd1026 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [14:21:50] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) >>! In T346948#9291643, @VRiley-WMF wrote: > cloudvirt-wdqs1003 has been relocated > > cloudvirt-wdqs1003 - C 8. U 21. port 18. CableID 4015 >... [14:22:19] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) 05Resolved→03Open [14:22:24] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Toolforge k8s: Migrate workers to Containerd and Bookworm - https://phabricator.wikimedia.org/T284656 (10taavi) [14:22:27] 10Toolforge, 10cloud-services-team, 10Patch-For-Review: toolforge prometheus servers OOMing - https://phabricator.wikimedia.org/T350227 (10taavi) [14:22:29] 10Toolforge (Toolforge iteration 02), 10cloud-services-team, 10Kubernetes, 10Patch-For-Review: Upgrade cadvisor - https://phabricator.wikimedia.org/T349795 (10taavi) [14:25:33] (InstanceDown) resolved: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:26:20] 10Tools: 'deletion-notification-bot-2' tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T349898 (10mdaniels5757) Yikes! Thanks for catching this. I'll probably be able to implement some sort of fix in a few days. [14:26:40] PROBLEM - Host cloudcephosd1027 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:32:38] RECOVERY - Host cloudcephosd1027 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:34:26] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10VRiley-WMF) @cmooney I have replaced the DAC cable and updated Netbox with the CableID; also I reseated the NIC for good measure. It is plugged into the sa... [14:36:33] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:36:38] PROBLEM - Host cloudcephosd1028 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:08] RECOVERY - Host cloudcephosd1028 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:44:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [14:47:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [14:47:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [14:48:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [14:49:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [14:49:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node (348643) [14:50:11] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [14:50:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (348643) [14:52:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [14:52:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [14:55:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [15:02:49] (SystemdUnitDownForLong) firing: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [15:10:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [15:12:27] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm executed wit... [15:16:33] (InstanceDown) resolved: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:23:23] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm [15:25:37] (CephClusterInWarning) firing: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:26:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:40:34] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10DC-Ops, 10SRE, 10ops-eqiad: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) slice from this morning: P53124 (by the way, I'm generating that with ` cumin1001:~$ sudo cumin "P{clou... [15:56:54] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10cmooney) @Jclark-ctr had a look at the NIC riser card wasn't properly seated. After re-seating the card the server connection seems to be working, current... [15:57:21] 10wikitech.wikimedia.org, 10TimedMediaHandler: Support video on wikitech wiki - https://phabricator.wikimedia.org/T174476 (10bd808) Probably best fixed after {T237773}/{T292707} [16:05:08] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b9bd4e38-25ed-4ed0-bdf7-47bd52027bdc) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) an... [16:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:09:30] 10Tool-ducttape, 10Abstract Wikipedia team, 10function-evaluator, 10function-orchestrator, 10Abstract Wikipedia Fix-It tasks: Improve CI for Wikifunctions services to better test like reality - https://phabricator.wikimedia.org/T350284 (10Jdforrester-WMF) [16:50:18] 10Cloud-VPS, 10cloud-services-team: Check if nfs-maps.wikimedia.org is still in use - https://phabricator.wikimedia.org/T350259 (10bd808) I think that {T300694} would have replaced the need for the ips and hostname. [17:00:37] (CephClusterInWarning) resolved: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [17:01:05] RECOVERY - Check unit status of backup_vms on cloudbackup1004 is OK: OK: Status of the systemd unit backup_vms https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_unit_status_of_backup_vms [17:02:34] (SystemdUnitDown) resolved: The service unit backup_vms.service is in failed status on host cloudbackup1004. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:02:34] (SystemdUnitDownForLong) resolved: The systemd unit backup_vms.service on node cloudbackup1004 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [17:15:42] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:16:31] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [17:17:57] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:18:24] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:20:16] 10VPS-project-Wikistats: Warcraft Wiki - https://phabricator.wikimedia.org/T350246 (10Dzahn) a:03Dzahn [17:20:33] 10VPS-project-Wikistats: Add dgawiki to wikistats - https://phabricator.wikimedia.org/T350233 (10Dzahn) a:03Dzahn [17:20:55] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudvirt-wdqs1002.eqiad.wmnet with OS bookworm completed: -... [17:25:38] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:25:56] !log taavi@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [17:26:56] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:27:23] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:36:40] !log taavi@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:37:11] !log taavi@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:39:20] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) I believe this is all done. Thank you everyone! [17:40:45] 10Cloud-VPS, 10cloud-services-team: cloudvirt: eqiad1: connect them to cloud-private - https://phabricator.wikimedia.org/T346651 (10taavi) [17:40:48] 10Cloud-VPS, 10cloud-services-team, 10Data-Platform-SRE, 10SRE, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) 05Open→03Resolved [17:40:54] 10cloud-services-team (FY2023/2024-Q1), 10SRE, 10ops-eqiad, 10Goal: cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [17:43:17] 10VPS-project-Wikistats: Add bjnwikiquote to wikistats - https://phabricator.wikimedia.org/T350239 (10Dzahn) a:03Dzahn [17:51:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:01:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:36:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:46:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:59:27] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Make technischewuensche tool code repository public - https://phabricator.wikimedia.org/T349847 (10Aklapper) @WMDE-Fisch: Before I attempt to delete (not sure if being in a Space will get into the way though I think it's unlikely), can y... [19:00:58] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) p:05Triage→03Low a:05WMDE-Fisch→03Aklapper [19:01:39] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) [19:02:20] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) > and start using GitLab in the first place with a new re-viewed submission of the code that can be shared. Let's make... [19:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:09:24] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10thiemowmde) I think there is only the live copy on https://technischewuensche.toolforge.org. Might be better to push it to GitLab... [19:27:11] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Dzahn) It would be easy to import it into WMF Gitlab just by pasting the URL... IF... we could make it public. If there are conce... [19:27:58] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Dzahn) How large is it? Is it really that much work to check for private data? [19:39:26] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:40:39] (03PS1) 10Majavah: Add homer public repository [labs/codesearch] - 10https://gerrit.wikimedia.org/r/970852 [19:54:13] (03PS1) 10BryanDavis: dev(Makefile): Prefer Docker Compose v2 [labs/striker] - 10https://gerrit.wikimedia.org/r/970853 [19:54:15] (03PS1) 10BryanDavis: dev: Bump GitLab container to v16.3.6 [labs/striker] - 10https://gerrit.wikimedia.org/r/970854 [19:54:17] (03PS1) 10BryanDavis: gitlab: Handle error response JSON decode failures gracefully [labs/striker] - 10https://gerrit.wikimedia.org/r/970855 [19:55:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [20:11:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [20:11:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:16:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:26:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [20:33:56] (ToolsToolsDBWritableState) firing: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [20:36:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10Andrew) again! But three hours late this time. Nov 1 20:29:33 [20:38:56] (ToolsToolsDBWritableState) resolved: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [20:46:07] 10Toolforge Build Service (Beta release): [buildservice] Bug - .m2 folder (local maven repository) is not cached between builds - https://phabricator.wikimedia.org/T350307 (10Don-vip) [20:46:37] 10Toolforge Build Service (Beta release): [buildservice] Bug - .m2 folder (local maven repository) is not cached between builds - https://phabricator.wikimedia.org/T350307 (10Don-vip) [21:14:57] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [21:15:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [21:15:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (348643) [21:15:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.drain_node (348643) [21:26:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:41:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:55:12] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10bd808) We should tell systemd to tell oomkiller not to kill our mariadb process. I think this can be done by usin... [22:09:24] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:36:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:46:03] (InstanceDown) firing: (2) Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [22:52:47] 10Cloud-VPS, 10cloud-services-team, 10Traffic-Icebox: Get traffic team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273737 (10BCornwall) 05Open→03Stalled @BBlack is this something we still want to pursue? [22:52:50] 10Cloud-VPS, 10cloud-services-team, 10Patch-Needs-Improvement: Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis - https://phabricator.wikimedia.org/T209011 (10BCornwall) [23:00:07] 10Tools: QuickStatements anti-abuse measure (rate limit?) - Cannot automatically assign ID - https://phabricator.wikimedia.org/T350262 (10Aklapper) https://toolsadmin.wikimedia.org/tools/id/quickstatements has no info where the QuickStatements issue tracker is but I'd assume this must be reported at https://gith... [23:13:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (348643) [23:15:51] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Delete technischewuensche tool code repository in Diffusion - https://phabricator.wikimedia.org/T349847 (10Aklapper) > How large is it? Is it really that much work to check for private data? I'd say that's irrelevant. Statement was that... [23:39:41] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:52:35] 10Tools: QuickStatements anti-abuse measure (rate limit?) - Cannot automatically assign ID - https://phabricator.wikimedia.org/T350262 (10M2k_dewiki) Also see https://github.com/magnusmanske/quickstatements/issues/51