[00:03:54] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:11:15] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:23:53] (OpenstackAPIResponse) firing: (4) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [00:32:03] PROBLEM - ensure kvm processes are running on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:43:43] PROBLEM - ensure kvm processes are running on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:06:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:56:40] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [01:57:03] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [01:57:33] RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [01:58:23] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [01:58:50] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [01:59:41] RECOVERY - ensure kvm processes are running on cloudvirt1037 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:00:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [02:00:29] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [02:00:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:01:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:17:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [02:23:26] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [02:23:32] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [02:26:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1039.eqiad.wmnet with OS bookworm [02:26:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:27:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [02:32:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:34:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [02:34:37] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [02:44:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [02:46:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:52:03] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [02:52:09] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [02:55:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [02:59:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [02:59:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [02:59:07] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [02:59:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [02:59:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [03:00:07] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1040.eqiad.wmnet with OS bookworm [03:04:30] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [03:04:47] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [03:06:53] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [03:07:15] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [03:08:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [03:08:17] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [03:08:55] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1039.eqiad.wmnet with OS bookworm completed: - cloudvirt1039 (**... [03:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:09:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1041.eqiad.wmnet with OS bookworm [03:25:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [03:26:08] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [03:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:33:41] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1042.eqiad.wmnet with OS bookworm [03:34:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) [03:43:19] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1040.eqiad.wmnet with OS bookworm completed: - cloudvirt1040 (**... [03:46:17] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1042.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [03:46:42] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [03:47:09] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [03:47:20] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [03:47:29] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [03:48:15] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1042.eqiad.wmnet with OS bookworm [03:48:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [03:48:33] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [03:48:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [03:49:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [03:49:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1041.eqiad.wmnet with OS bookworm completed: - cloudvirt1041 (**... [03:49:53] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm [03:50:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [04:11:27] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345811) [04:11:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:11:34] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [04:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:14:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [04:15:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345811) [04:18:07] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [04:18:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [04:18:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain [04:19:13] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345811) [04:19:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) [04:22:33] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [04:23:00] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm [04:23:54] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [04:24:33] (SystemdUnitDown) firing: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:25:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-65 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:30:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-65 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:30:48] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [04:31:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1042.eqiad.wmnet with OS bookworm completed: - cloudvirt1042 (**... [04:31:57] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [04:32:15] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [04:40:18] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10tchin) I think the per-image quota should probably be increased. I tested building a few projects locally and a project with NodeJS and 0 dependencies results in a built image that's... [04:46:59] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:47:04] 10cloud-services-team: PuppetFailure cloudvirt2001-dev:9100 Puppet failure on cloudvirt2001-dev:9100 - https://phabricator.wikimedia.org/T351169 (10phaultfinder) [04:57:34] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [04:58:21] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm [05:02:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain (T345811) [05:02:34] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [05:03:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) (T345811) [05:03:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1045.eqiad.wmnet with OS bookworm [05:05:13] 10Cloud-VPS, 10cloud-services-team: cloudvirt1043 reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) [05:05:25] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [05:14:36] 10Tool-iw, 10Toolforge: iw.toolforge.org does not support URL-encoded query parameters ([[toolforge:foo?bar]]) - https://phabricator.wikimedia.org/T345783 (10Legoktm) The far easier solution would be to just have the tool in question not use a query string for parameters, aka `[[toolforge:scholia/Q1513315]]`,... [05:18:29] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [05:18:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm [05:40:49] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [05:41:08] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [05:41:23] 10Cloud-VPS, 10cloud-services-team: cloudvirt1043 + cloudvirt1044 reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) [05:41:59] 10Cloud-VPS, 10cloud-services-team: cloudvirt1043 + cloudvirt1044 reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) [05:42:02] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [05:42:22] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [05:42:57] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1045.eqiad.wmnet with OS bookworm completed: - cloudvirt1045 (**... [05:45:40] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [06:01:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [06:19:33] (SystemdUnitDownForLong) firing: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [06:30:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [06:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [07:03:49] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:11:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [08:10:33] (SystemdUnitDown) firing: The service unit systemd-machined.service is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:10:33] (SystemdUnitDown) firing: The service unit systemd-machined.service is in failed status on host cloudvirt1054. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1054 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:10:39] (SystemdUnitDown) firing: The service unit systemd-machined.service is in failed status on host cloudvirt1056. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:10:44] (SystemdUnitDown) firing: The service unit systemd-machined.service is in failed status on host cloudvirt1050. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:13:14] (PuppetFailure) firing: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:24:26] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [08:24:48] (SystemdUnitDown) firing: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [08:39:21] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Slst2020) Some notes on the current setup: * A Harbor project corresponds to a tool * Each project has one repository * Storage limits are applied on a per-project basis only. Harbor... [08:40:30] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Slst2020) [08:40:34] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) [08:40:36] 10Toolforge (Toolforge iteration 02): [builds-cli,builds-api] Allow build service to cleanup images to free quota - https://phabricator.wikimedia.org/T341067 (10Slst2020) [08:46:45] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Slst2020) [08:46:47] 10Toolforge (Toolforge iteration 02), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, 10User-dcaro: [tbs.maintain-harbor] Document current setup and admin procedures - https://phabricator.wikimedia.org/T329176 (10Slst2020) [08:47:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:51:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:52:21] 10Toolforge (Toolforge iteration 02): [tbs] Give a meaningful error message when a user exceeds their Harbor quota - https://phabricator.wikimedia.org/T351178 (10Slst2020) [09:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:30:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [09:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [10:01:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:05:33] (SystemdUnitDownForLong) firing: The systemd unit systemd-machined.service on node cloudvirt1056 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [10:05:33] (SystemdUnitDownForLong) firing: The systemd unit systemd-machined.service on node cloudvirt1054 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1054 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [10:05:39] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1054:9100 Unit systemd-machined.service on node cloudvirt1054 has been down for long. - https://phabricator.wikimedia.org/T351185 (10phaultfinder) [10:05:39] (SystemdUnitDownForLong) firing: The systemd unit systemd-machined.service on node cloudvirt1047 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [10:05:41] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1050:9100 Unit systemd-machined.service on node cloudvirt1050 has been down for long. - https://phabricator.wikimedia.org/T351186 (10phaultfinder) [10:05:43] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1047:9100 Unit systemd-machined.service on node cloudvirt1047 has been down for long. - https://phabricator.wikimedia.org/T351187 (10phaultfinder) [10:05:44] (SystemdUnitDownForLong) firing: The systemd unit systemd-machined.service on node cloudvirt1050 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [10:05:45] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1056:9100 Unit systemd-machined.service on node cloudvirt1056 has been down for long. - https://phabricator.wikimedia.org/T351188 (10phaultfinder) [10:11:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:14:33] (SystemdUnitDown) resolved: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:14:34] (SystemdUnitDownForLong) resolved: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [10:15:33] (SystemdUnitDownForLong) resolved: The systemd unit systemd-machined.service on node cloudvirt1050 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [10:15:34] (SystemdUnitDown) resolved: The service unit systemd-machined.service is in failed status on host cloudvirt1050. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1050 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [10:17:59] (PuppetFailure) resolved: Puppet has failed on cloudcumin1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:20:03] (PuppetAgentNoResources) resolved: No Puppet resources found on instance tools-sgeweblight-10-22 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [10:21:41] 10Tool-nlwikibots: When a category is nominated on TBP (Dutch AfD) the cat's creator's talk page is categorized - https://phabricator.wikimedia.org/T351190 (10FrankGeerlings) [10:23:21] 10Tool-nlwikibots: When a category is nominated on TBP (Dutch AfD) the cat's creator's talk page is categorized - https://phabricator.wikimedia.org/T351190 (10FrankGeerlings) p:05Triage→03Low [10:24:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-35 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:34:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-35 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:11:11] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:31:44] (03PS1) 10FNegri: set_maintenance: do not downtime host [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/974161 [11:41:20] 10Tool-Global-user-contributions: GUC unreachable - https://phabricator.wikimedia.org/T351194 (10Jeff_G) [11:44:44] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1046.eqiad.wmnet' (T345811) [11:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:44:50] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [12:00:33] (SystemdUnitDownForLong) resolved: The systemd unit systemd-machined.service on node cloudvirt1056 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [12:00:34] (SystemdUnitDown) resolved: The service unit systemd-machined.service is in failed status on host cloudvirt1056. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1056 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:03:11] !log admin fran@wmf3169 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1046.eqiad.wmnet' (T345811) [12:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:03:17] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [12:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:10:34] (SystemdUnitDown) firing: The service unit systemd-machined.service is in failed status on host cloudvirt1054. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1054 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:10:48] (SystemdUnitDown) firing: The service unit systemd-machined.service is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:21:46] 10Cloud-VPS, 10cloud-services-team: systemd-machined crashing on some cloudvirts - https://phabricator.wikimedia.org/T351203 (10taavi) [12:24:27] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:32:04] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:42:05] 10Tool-Global-user-contributions: GUC unreachable - https://phabricator.wikimedia.org/T351194 (10Jeff_G) An hour later, it is responding. [12:47:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:51:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:54:04] (CephSlowOps) firing: Ceph cluster in eqiad has 18 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [12:54:04] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [12:58:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 1 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [13:05:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm [13:10:33] (SystemdUnitDownForLong) resolved: The systemd unit systemd-machined.service on node cloudvirt1054 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1054 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [13:10:34] (SystemdUnitDown) resolved: The service unit systemd-machined.service is in failed status on host cloudvirt1054. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1054 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:11:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:30:33] (SystemdUnitDown) resolved: The service unit systemd-machined.service is in failed status on host cloudvirt1047. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:30:33] (SystemdUnitDownForLong) resolved: The systemd unit systemd-machined.service on node cloudvirt1047 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDownForLong - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudvirt1047 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDownForLong [13:41:32] 10Data-Services, 10cloud-services-team, 10Infrastructure-Foundations, 10Patch-For-Review: nftables ignores drange filter for IPv6 if drange only has IPv4 addresses - https://phabricator.wikimedia.org/T351094 (10jbond) > However, the IPv6 rule should not be there, right now it's incorrectly allowing v6 traf... [14:01:15] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:05:41] 10Tool-bub2: Use .jsx for files that containt JSX syntax - https://phabricator.wikimedia.org/T348505 (10Aklapper) a:05SamMintah→03None [14:18:39] 10Tool-iw, 10Toolforge: iw.toolforge.org does not support URL-encoded query parameters ([[toolforge:foo?bar]]) - https://phabricator.wikimedia.org/T345783 (10Mike_Peel) Thanks @Legoktm - that seems to work for Scholia. However, it doesn't work for Resonator - try https://iw.toolforge.org/reasonator/Q1513315 vs... [14:27:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [14:31:56] 10Cloud-VPS, 10cloud-services-team: cloudvirt1043 + cloudvirt1044 reimage failures - https://phabricator.wikimedia.org/T351171 (10fnegri) clouvirt1046 also shows a blank screen forever in `console com2`. Full output of the reimage cookbook for cloudvirt1046: https://phabricator.wikimedia.org/P53419 [14:32:38] 10Cloud-VPS, 10cloud-services-team: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10fnegri) [15:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:11:03] 10Cloud-VPS, 10cloud-services-team: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) We may find more of these as we roll through the remaining dozen cloudvirts. For now, though, let's start with FW updates for these hosts. [15:25:27] 10Cloud-VPS, 10cloud-services-team: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10fnegri) > let's start with FW updates for these hosts. what is the procedure for FW updates? [15:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:48:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm [15:48:42] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm [16:00:04] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [16:00:11] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [16:00:38] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm [16:00:43] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm [16:11:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:14:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm [16:14:43] 10Striker: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784 (10bd808) [16:28:55] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [16:37:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm [16:40:47] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10fnegri) 05Open→03In progress p:05Triage→03High [16:40:51] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10fnegri) [16:42:51] (03PS2) 10FNegri: upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 [16:44:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1044.eqiad.wmnet with OS bookworm completed: - cloudvirt1044 (**... [16:47:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:05:34] 10Cloud-VPS, 10cloud-services-team: systemd-machined crashing on some cloudvirts - https://phabricator.wikimedia.org/T351203 (10taavi) [17:05:36] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1056:9100 Unit systemd-machined.service on node cloudvirt1056 has been down for long. - https://phabricator.wikimedia.org/T351188 (10taavi) [17:05:38] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1047:9100 Unit systemd-machined.service on node cloudvirt1047 has been down for long. - https://phabricator.wikimedia.org/T351187 (10taavi) [17:05:40] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1050:9100 Unit systemd-machined.service on node cloudvirt1050 has been down for long. - https://phabricator.wikimedia.org/T351186 (10taavi) [17:05:42] 10cloud-services-team: SystemdUnitDownForLong cloudvirt1054:9100 Unit systemd-machined.service on node cloudvirt1054 has been down for long. - https://phabricator.wikimedia.org/T351185 (10taavi) [17:12:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [17:12:42] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm [17:14:10] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) 05In progress→03Stalled [17:14:36] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) [17:15:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) 05Stalled→03In progress [17:17:35] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:18:00] !log andrew@cloudcumin1001 cloudvirt-canary END (FAIL) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=99) [17:18:55] (OpenstackAPIResponse) firing: (3) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [17:21:55] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [17:22:24] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [17:22:42] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [17:24:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1047.eqiad.wmnet' (T345811) [17:24:22] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [17:24:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1047.eqiad.wmnet' (T345811) [17:26:30] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) p:05Triage→03Medium @taavi @ABran-WMF could you please review the alerts in my previous comment and let me know i... [17:30:14] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) @Jclark-ctr did firmware upgrades. [] 1043 has the same grub prompt issue as before [x] 1044 is now working properly and back in service. [] 1046 has the same 'hangs a... [17:30:45] (03CR) 10Andrew Bogott: [C: 03+2] set_maintenance: do not downtime host [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/974161 (owner: 10FNegri) [17:33:59] (03Merged) 10jenkins-bot: set_maintenance: do not downtime host [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/974161 (owner: 10FNegri) [17:39:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1047.eqiad.wmnet' (T345811) [17:39:54] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [17:43:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1048.eqiad.wmnet' (T345811) [17:53:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) btw those hosts (cloudvirt1043 and cloudvirt1046) are fully out of service and can be restarted or reimaged at any time. [17:58:35] !log taavi@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [17:59:04] !log taavi@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [18:01:19] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1047.eqiad.wmnet' (T345811) [18:01:24] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [18:02:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1047.eqiad.wmnet' (T345811) [18:03:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1047.eqiad.wmnet' (T345811) [18:04:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bookworm [18:06:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:06:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1049.eqiad.wmnet' (T345811) [18:06:28] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [18:09:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:10:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1048.eqiad.wmnet' (T345811) [18:11:20] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bookworm [18:31:07] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=97) on host 'cloudvirt1049.eqiad.wmnet' (T345811) [18:31:13] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [18:32:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:33:46] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [18:38:29] 10PAWS: PAWS terraform to opentofu? - https://phabricator.wikimedia.org/T351249 (10rook) [18:51:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bookworm completed: - cloudvirt1047 (**... [18:53:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1048.eqiad.wmnet with OS bookworm completed: - cloudvirt1048 (**... [18:54:51] PROBLEM - ensure kvm processes are running on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:55:21] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [18:55:47] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [18:55:53] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [18:56:18] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [18:57:02] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bookworm [18:57:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1050.eqiad.wmnet' (T345811) [18:57:11] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [18:57:33] RECOVERY - ensure kvm processes are running on cloudvirt1048 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:00:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1052.eqiad.wmnet' (T345811) [19:03:39] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.component.deploy for component jobs-api [19:03:52] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.component.deploy (exit_code=0) for component jobs-api [19:04:03] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:11:33] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:17:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1050.eqiad.wmnet' (T345811) [19:17:31] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [19:18:10] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1052.eqiad.wmnet' (T345811) [19:18:16] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bookworm [19:21:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:27:02] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:38:59] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [19:39:05] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [19:40:30] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1049.eqiad.wmnet with OS bookworm completed: - cloudvirt1049 (**... [19:41:39] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1052.eqiad.wmnet with OS bookworm [19:47:23] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:47:29] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [19:49:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:50:02] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:51:23] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) Reimage process is e.g. andrew@cumin1001:~$ sudo cookbook sre.hosts.reimage --new --puppet 5 --os bookworm -t T345811 cloudvirt1043 [19:54:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:54:35] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [19:55:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:58:05] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [19:58:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1054.eqiad.wmnet' (T345811) [19:58:26] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [19:59:23] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [19:59:26] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [19:59:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T345811) [19:59:38] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [20:00:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1050.eqiad.wmnet with OS bookworm completed: - cloudvirt1050 (**... [20:00:10] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1053.eqiad.wmnet' (T345811) [20:05:35] 10PAWS: Remove db_password variable - https://phabricator.wikimedia.org/T351255 (10rook) [20:05:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T345811) [20:05:56] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [20:06:34] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1053.eqiad.wmnet' (T345811) [20:09:32] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye [20:09:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T345811) [20:10:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1053.eqiad.wmnet' (T345811) [20:11:14] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1053.eqiad.wmnet with OS bookworm [20:12:03] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) After I lowered `innodb_buffer_pool_size` from 31G to 10G [[ https://phabricator.wikimedia.org/T34969... [20:12:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1055.eqiad.wmnet' (T345811) [20:12:50] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [20:15:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1054.eqiad.wmnet' (T345811) [20:22:57] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:23:19] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10RobH) After chatting with @andrew in IRC I decided to take a look at this to help out: * checked all firmware versions were indeed updated correctly, yep * checked all bios se... [20:23:19] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [20:24:56] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1052.eqiad.wmnet with OS bookworm completed: - cloudvirt1052 (**... [20:26:04] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet with OS bookworm [20:26:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1056.eqiad.wmnet' (T345811) [20:26:38] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [20:28:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1055.eqiad.wmnet' (T345811) [20:30:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet with OS bookworm [20:32:11] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye executed with errors: - cloudvirt1043 (**... [20:33:41] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye [20:43:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:47:14] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:55:05] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1053.eqiad.wmnet with OS bookworm completed: - cloudvirt1053 (**... [20:57:36] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [20:57:43] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:02:51] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:03:10] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:04:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1054.eqiad.wmnet with OS bookworm completed: - cloudvirt1054 (**... [21:08:03] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:08:24] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:08:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1056.eqiad.wmnet' (T345811) [21:09:03] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [21:09:10] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:09:28] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet with OS bookworm [21:09:37] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1055.eqiad.wmnet with OS bookworm completed: - cloudvirt1055 (**... [21:10:30] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:10:52] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:12:18] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bullseye completed: - cloudvirt1043 (**PASS**) -... [21:13:31] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) @Jclark-ctr did firmware upgrades. [x] 1043 reimaged just fine as soon as Rob did it instead of me [x] 1044 is now working properly and back in service. [] 1046 has th... [21:16:27] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm [21:18:55] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [21:23:32] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm [21:32:10] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:39:41] 10PAWS: PAWS terraform to opentofu? - https://phabricator.wikimedia.org/T351249 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/349 [21:39:49] vivian-rook opened https://github.com/toolforge/paws/pull/349 [21:52:07] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:52:28] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [21:52:28] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1056.eqiad.wmnet with OS bookworm completed: - cloudvirt1056 (**... [21:59:02] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [21:59:14] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [22:00:35] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host cloudvirt1043.eqiad.wmnet with OS bookworm completed: - cloudvirt1043 (**WARN**) -... [22:05:46] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1046.eqiad.wmnet with OS bookworm executed with errors: - cloudv... [22:06:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:06:34] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1): cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 (10Andrew) [x] 1043 works fine when Rob reimages it. [x] 1044 is now working properly and back in service. [] 1046 has the same 'hangs at a blank screen during reboot' issue [22:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [23:04:07] 10Tools: 'wikitanvirbot' tool missing pywikibot config - https://phabricator.wikimedia.org/T349916 (10Wikitanvir) 05Open→03Resolved [23:04:12] 10Toolforge, 10cloud-services-team: tools-nfs-2 almost out of disk space (October 2023 edition) - https://phabricator.wikimedia.org/T349895 (10Wikitanvir) [23:33:43] PROBLEM - ensure kvm processes are running on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:36:15] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary [23:36:39] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) [23:37:48] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: [tools,harbor] Cleanup old production images - https://phabricator.wikimedia.org/T348538 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/18 [maintain-harbor] minor readability r... [23:37:49] RECOVERY - ensure kvm processes are running on cloudvirt1043 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:39:09] 10Toolforge: Tool (k8s-status or a new one) to display details about buildservice pipelines and Harbor images - https://phabricator.wikimedia.org/T336133 (10Raymond_Ndibe) [23:39:12] 10Toolforge (Toolforge iteration 02): [envvars-api] Add statistics - https://phabricator.wikimedia.org/T346228 (10Raymond_Ndibe) 05Stalled→03Resolved [23:39:14] 10Toolforge: Tool (k8s-status or a new one) to display details about buildservice pipelines and Harbor images - https://phabricator.wikimedia.org/T336133 (10Raymond_Ndibe) [23:40:30] 10Toolforge (Toolforge iteration 02), 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Team, 10Cloud-Services-Worktype-Project, and 3 others: [builds-api.start] Add statistics - https://phabricator.wikimedia.org/T337390 (10Raymond_Ndibe) 05Stalled→03Resolved [23:50:20] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/56 [builds-api] forc... [23:50:55] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/19 [envvars-api] fo... [23:51:45] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [apis] nginx fails to reload on config change - https://phabricator.wikimedia.org/T350928 (10Raymond_Ndibe) 05Open→03In progress