[00:10:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:15:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [00:42:58] 10cloud-services-team (FY2023/2024-Q1), 10wikitech.wikimedia.org: [wikitech] administrator rights for WMCS - https://phabricator.wikimedia.org/T347557 (10bd808) >>! In T347557#9335787, @bd808 wrote: > We should probably just import some nicer 'protectedpagetext' message overrides from [[https://meta.wikimedia.... [00:47:57] 10cloud-services-team (FY2023/2024-Q1), 10wikitech.wikimedia.org, 10User-bd808: [wikitech] administrator rights for WMCS - https://phabricator.wikimedia.org/T347557 (10bd808) 05Open→03Resolved a:03bd808 >>! In T347557#9231642, @nskaggs wrote: > We need a bureaucrat to grant permissions: https://wikitec... [00:51:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:52:59] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:56:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:01:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [01:12:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:22:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:46:39] 10Grid-Engine-to-K8s-Migration: Migrate zumraband from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320211 (10komla) Hello, This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to... [01:46:41] 10Grid-Engine-to-K8s-Migration: Migrate zkbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320208 (10komla) Hello, This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to mig... [01:46:43] 10Grid-Engine-to-K8s-Migration: Migrate zimmerbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320207 (10komla) Hello, This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to... [01:52:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:57:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:02:54] 10Grid-Engine-to-K8s-Migration: Migrate zayenbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320201 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate... [02:02:56] 10Grid-Engine-to-K8s-Migration: Migrate xslack from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320192 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate t... [02:02:58] 10Grid-Engine-to-K8s-Migration: Migrate wugbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320190 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate t... [02:03:00] 10Grid-Engine-to-K8s-Migration: Migrate wscontest from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320189 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrat... [02:03:02] 10Grid-Engine-to-K8s-Migration: Migrate wnegar from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320182 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate t... [02:03:04] 10Grid-Engine-to-K8s-Migration: Migrate wmtran from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320181 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate t... [02:03:06] 10Grid-Engine-to-K8s-Migration: Migrate wmds-archive from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320179 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to mig... [02:03:09] 10Grid-Engine-to-K8s-Migration: Migrate wmch from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320175 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate to... [02:03:11] 10Grid-Engine-to-K8s-Migration: Migrate wm-metrics from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320174 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migra... [02:03:13] 10Grid-Engine-to-K8s-Migration: Migrate wle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320171 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate to T... [02:03:15] 10Grid-Engine-to-K8s-Migration: Migrate wikitasks from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320168 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrat... [02:03:17] 10Grid-Engine-to-K8s-Migration: Migrate wikitanvirbot from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320167 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to mi... [02:03:19] 10Grid-Engine-to-K8s-Migration: Migrate wikintu from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320161 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate... [02:03:21] 10Grid-Engine-to-K8s-Migration: Migrate wikihistory from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320157 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migr... [02:03:23] 10Grid-Engine-to-K8s-Migration, 10Tool-wikiloves: Migrate wikiloves from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320160 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining... [02:03:25] 10Grid-Engine-to-K8s-Migration: Migrate wikidata-timeline from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320154 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need t... [02:03:27] 10Grid-Engine-to-K8s-Migration, 10User-Dereckson: Migrate wikidata-nolabels from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320152 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all re... [02:03:29] 10Grid-Engine-to-K8s-Migration: Migrate wikidata-compare from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320150 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to... [02:03:31] 10Grid-Engine-to-K8s-Migration: Migrate welcomebots-bn from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320145 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to m... [02:03:34] 10Grid-Engine-to-K8s-Migration: Migrate wdvaliditycheck from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320144 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to... [02:03:36] 10Grid-Engine-to-K8s-Migration: Migrate wahldiagramm from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320132 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to mig... [02:03:38] 10Grid-Engine-to-K8s-Migration: Migrate vtwo from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320130 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate to... [02:03:40] 10Grid-Engine-to-K8s-Migration: Migrate vocabulary-index from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320128 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to... [02:03:42] 10Grid-Engine-to-K8s-Migration: Migrate vltools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320127 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migrate... [02:03:45] 10Grid-Engine-to-K8s-Migration: Migrate videoconvert from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320126 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to mig... [02:03:47] 10Grid-Engine-to-K8s-Migration: Migrate vectorizer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320120 (10komla) This is a reminder that the tool for which this ticket is created is still running on the Grid. The grid is deprecated and all remaining tools need to migra... [02:18:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:21:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:38:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:40:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:50:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [02:52:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:02:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:05:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:10:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:19:21] 10cloud-services-team: galera lock-up in codfw1dev - https://phabricator.wikimedia.org/T351281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm [03:25:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:35:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:38:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [03:48:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [03:48:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:57:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:02:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:05:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:06:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:10:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:11:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [04:14:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:19:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:41:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:42:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:46:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [04:47:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:51:59] (PuppetFailure) firing: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:52:59] (PuppetFailure) firing: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:57:56] 10cloud-services-team: galera lock-up in codfw1dev - https://phabricator.wikimedia.org/T351281 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm completed: - cloudcontrol2001-dev (**WARN**) - Downtimed on Icinga/Al... [05:02:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:12:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:17:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:22:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [05:28:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2005-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:46:59] (PuppetFailure) resolved: Puppet has failed on cloudcontrol2004-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:49:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [05:51:02] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:51:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [05:57:59] (PuppetFailure) resolved: Puppet has failed on cloudvirt2001-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:04:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:09:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:36:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [06:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [06:41:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:08:23] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:10:28] 10Grid-Engine-to-K8s-Migration: Migrate wikitasks from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320168 (10Vort) > This is a reminder No need for reminder until T295220 is fixed. Am I talking to bots here? [07:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [07:52:41] 10Grid-Engine-to-K8s-Migration: Migrate phetools from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319965 (10Xover) Oh, somehow missed that this task got created and assigned to me. The short status for this is that I inherited phetools when the original contributor (Phe) we... [08:05:20] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2002 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:19:51] (ProbeDown) firing: Service toolsbeta-test-k8s-haproxy-4:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-4:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:24:51] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:32:34] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) @komla I want this task to run continuously, so I removed the schedule setting and left continuous: true. As for the last three tasks, they are set to run on a sched... [08:34:51] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:39:19] 10Tools, 10WMDE-TechWish-Maintenance, 10WMDE-TechWish-Maintenance-2023: Check technischewuensche tool code and publish in a public repo - https://phabricator.wikimedia.org/T350352 (10WMDE-Fisch) Thanks, we'll be able to talk it through with the remaining involved role at the beginning of next week. So I'm op... [09:04:25] 10Toolforge (Toolforge iteration 02): [tbs] Improve Harbor quota handling and docs - https://phabricator.wikimedia.org/T351092 (10Slst2020) [09:16:49] 10Grid-Engine-to-K8s-Migration: Migrate wle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320171 (10RLuts) 05Open→03Resolved [09:17:06] 10Grid-Engine-to-K8s-Migration: Migrate wle from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320171 (10RLuts) done [09:34:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:35:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:39:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:40:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [09:41:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:46:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:46:37] (CephSlowOps) firing: Ceph cluster in eqiad has 101 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:46:42] 10cloud-services-team: CephSlowOps Ceph cluster in eqiad has slow ops, which might be blocking some writes - https://phabricator.wikimedia.org/T349502 (10phaultfinder) [09:48:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [09:51:37] (CephSlowOps) resolved: Ceph cluster in eqiad has 6 slow ops - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephSlowOps - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephSlowOps [09:53:00] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [09:53:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [10:24:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:34:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [10:40:03] (PuppetAgentFailure) firing: Puppet agent failure detected on instance tools-sgebastion-11 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [10:46:55] 10cloud-services-team (FY2023/2024-Q1), 10wikitech.wikimedia.org, 10User-bd808: [wikitech] administrator rights for WMCS - https://phabricator.wikimedia.org/T347557 (10fnegri) Thanks @bd808, the new message is so much better! [10:59:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:04:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:19:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:29:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:43:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:48:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [11:51:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:05:50] !log taavi@cloudcumin2001 admin START - Cookbook wmcs.openstack.restart_openstack [12:09:24] !log taavi@cloudcumin2001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [12:11:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:11:36] 10Cloud-VPS, 10cloud-services-team, 10decommission-hardware: decommission cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10taavi) [12:23:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:26:17] 10Grid-Engine-to-K8s-Migration: Migrate wikitasks from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320168 (10Aklapper) [12:26:21] 10Toolforge: CERTIFICATE_VERIFY_FAILED error when trying to access Wikipedia API with Mono - https://phabricator.wikimedia.org/T295220 (10Aklapper) [12:33:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:36:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [12:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [12:45:03] (PuppetAgentFailure) resolved: Puppet agent failure detected on instance tools-sgebastion-11 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:45:57] 10Toolforge: CERTIFICATE_VERIFY_FAILED error when trying to access Wikipedia API with Mono - https://phabricator.wikimedia.org/T295220 (10taavi) I ran the provided test case and it seems to be working: `lang=shell-session :# on the bastion tools.taavi-test-tool@tools-sgebastion-11:~$ mono T295220/WikiTLSTest.exe... [12:46:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [13:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [13:19:25] punithnayak opened https://github.com/toolforge/paws/pull/350 [13:20:40] 10PAWS, 10good first task: Remove variables if unused. - https://phabricator.wikimedia.org/T350812 (10rook) https://github.com/toolforge/paws/pull/350 [13:20:50] 10PAWS, 10good first task: Remove variables if unused. - https://phabricator.wikimedia.org/T350812 (10rook) 05Open→03Resolved [13:31:41] 10PAWS: update opentofu version - https://phabricator.wikimedia.org/T351402 (10rook) [13:33:28] 10Cloud-VPS (Project-requests), 10cloud-services-team, 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of devel-stats VPS project - https://phabricator.wikimedia.org/T351330 (10rook) 05Open→03In progress a:03rook [13:38:25] 10Cloud-VPS (Project-requests), 10cloud-services-team, 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of devel-stats VPS project - https://phabricator.wikimedia.org/T351330 (10rook) C'est fait! ` root@cloudcontrol1005:~# openstack project create --description 'd... [13:38:39] 10Cloud-VPS (Project-requests), 10cloud-services-team, 10GitLab, 10Release-Engineering-Team (Quid Pro Crow 🦃), 10User-brennen: Request creation of devel-stats VPS project - https://phabricator.wikimedia.org/T351330 (10rook) 05In progress→03Resolved [13:51:33] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [14:11:53] 10Toolforge: CERTIFICATE_VERIFY_FAILED error when trying to access Wikipedia API with Mono - https://phabricator.wikimedia.org/T295220 (10Vort) @taavi how to apply this fix to my Toolforge account? I probably have some old data stuck somewhere. This is what I'm getting right now: {F41510918} [14:15:00] 10Toolforge: [toolsdb] Can't authenticate with Toolsdb - https://phabricator.wikimedia.org/T351410 (10Slst2020) [14:16:49] 10Toolforge: CERTIFICATE_VERIFY_FAILED error when trying to access Wikipedia API with Mono - https://phabricator.wikimedia.org/T295220 (10taavi) I deleted some Mono certificate cache directories from your tool's home directory: `lang=shell-session tools.wikitasks@tools-sgebastion-10:~$ rm -rf .config/.mono/certs... [14:19:43] 10Toolforge: CERTIFICATE_VERIFY_FAILED error when trying to access Wikipedia API with Mono - https://phabricator.wikimedia.org/T295220 (10Vort) 05Open→03Resolved a:03Vort Thank you. I will proceed with learning of how to migrate my tools next. [14:19:45] 10Grid-Engine-to-K8s-Migration: Migrate wikitasks from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320168 (10Vort) [14:33:55] 10Toolforge Jobs framework, 10cloud-services-team (FY2023/2024-Q1), 10Pywikibot, 10Patch-For-Review, 10User-Raymond_Ndibe: Create Docker image for Toolforge that is purpose built to run pywikibot scripts - https://phabricator.wikimedia.org/T249787 (10taavi) >>! In T249787#9253386, @taavi wrote: > * How t... [14:55:33] 10Data-Services: [toolsdb] Can't authenticate with Toolsdb - https://phabricator.wikimedia.org/T351410 (10JJMC89) [15:28:39] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10komla) @Kanashimi so you want only this job to be a continuous job? then you are good to go. [15:32:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [15:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [15:44:23] 10Grid-Engine-to-K8s-Migration: Migrate wikitasks from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320168 (10Vort) I started conversion process, but it will take several days to set up everything correctly and check if tools work stable enough. My tools are launched once a d... [15:59:07] 10Tools: HarvestTemplates not available - https://phabricator.wikimedia.org/T351427 (10M2k_dewiki) [16:03:32] 10Tools: HarvestTemplates not available - https://phabricator.wikimedia.org/T351427 (10JJMC89) 05Open→03Invalid Issues are tracked at https://github.com/Pascalco/harvesttemplates/issues. [16:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [16:23:31] 10Toolforge (Toolforge iteration 02), 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10Raymond_Ndibe) a:03Raymond_Ndibe [17:44:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:49:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [17:58:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:03:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:08:21] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:21:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:31:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:33:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [18:43:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:43:58] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Cloud-Services-Origin-Alert: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 (10fnegri) [18:44:04] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1), 10Data-Persistence: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943 (10fnegri) 05In progress→03Resolved Resolving for now, we can open a new task if we find edge cases where the current alerts... [18:48:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [18:48:07] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate Cloud VPS puppet infrastructure to Puppet 7 - https://phabricator.wikimedia.org/T351450 (10taavi) [18:48:34] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate Cloud VPS central puppet server to Puppet 7 - https://phabricator.wikimedia.org/T351451 (10taavi) [18:49:13] 10VPS-Projects, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate per-project Puppet servers to Puppet 7 - https://phabricator.wikimedia.org/T351452 (10taavi) [18:50:39] 10VPS-Projects, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate Puppet servers in Cloud Services team managed projects to Puppet 7 - https://phabricator.wikimedia.org/T351453 (10taavi) [18:50:55] 10VPS-Projects, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Migrate per-project Puppet servers to Puppet 7 - https://phabricator.wikimedia.org/T351452 (10taavi) [18:53:04] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Write script or cookbook to migrate data from a Puppet 5 puppetmaster to a Puppet 7 puppetserver - https://phabricator.wikimedia.org/T351454 (10taavi) [18:53:38] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Update designate-sink cert cleaning hook to work with Puppet 7 CA changes - https://phabricator.wikimedia.org/T351455 (10taavi) [19:03:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:05:24] RECOVERY - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is OK: OK: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:57] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10CodeReviewBot) raymond-ndibe opened https://gitlab.wikimedia.org... [19:11:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:11:20] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review, 10User-Raymond_Ndibe: [gitlab,toolforge-deploy] Create a process to open an MR to toolforge-deploy when a new release ofa component happens - https://phabricator.wikimedia.org/T347392 (10Raymond_Ndibe) 05Open→03In progress [19:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [19:15:56] (ToolsToolsDBReplicationLagIsTooHigh) firing: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 32469 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [19:17:10] PROBLEM - Check unit status of remove_dangling_cinder_snapshots on cloudbackup2001 is CRITICAL: CRITICAL: Status of the systemd unit remove_dangling_cinder_snapshots https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:21:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:25:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:30:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:36:35] (ProbeDown) firing: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:40:59] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2): [toolsdb] Replication stopped because of invalid event - https://phabricator.wikimedia.org/T351457 (10fnegri) [19:41:35] (ProbeDown) resolved: (2) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-test-k8s-haproxy-3:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:42:39] 10Data-Services, 10cloud-services-team (FY2023/2024-Q1-Q2): [toolsdb] Replication stopped because of invalid event - https://phabricator.wikimedia.org/T351457 (10fnegri) 05Open→03Resolved I have updated the runbook at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If... [19:43:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:48:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:49:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:51:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:54:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [19:54:42] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all workers [19:56:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:15:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:20:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:22:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:22:34] 10Tool-ducttape, 10Abstract Wikipedia team, 10Wikifunctions: New function orchestrator patches are tested with DUCT. - https://phabricator.wikimedia.org/T333191 (10Jdforrester-WMF) [20:23:04] 10Tool-ducttape, 10Abstract Wikipedia team, 10Wikifunctions: Run DUCT's end-to-end testing system on new function-orchestrator patches on GitLab, like we do on WikiLambda on gerrit - https://phabricator.wikimedia.org/T333191 (10Jdforrester-WMF) [20:27:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-70 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:39:26] (ToolsToolsDBReplicationLagIsTooHigh) resolved: ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 4043 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBReplicationLagIsTooHigh [20:40:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:45:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:50:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:51:20] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:55:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:56:20] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:07:03] (InstanceDown) firing: Project tools instance tools-k8s-worker-30 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:08:33] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all workers [21:12:03] (InstanceDown) resolved: Project tools instance tools-k8s-worker-30 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:29:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:34:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:36:03] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [21:36:33] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Andrew tries to make a cloud-vps puppet7 server - https://phabricator.wikimedia.org/T351468 (10Andrew) [21:38:27] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Andrew tries to make a cloud-vps puppet7 server - https://phabricator.wikimedia.org/T351468 (10Andrew) I'm a couple of patches in (https://gerrit.wikimedia.org/r/c/operations/puppet/+/975075 and https://gerrit.wikimedi... [21:41:03] (TfInfraTestDestroyFailed) firing: Terraform failed to destroy the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestDestroyFailed [21:44:53] 10Cloud-VPS, 10cloud-services-team, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Andrew tries to make a cloud-vps puppet7 server - https://phabricator.wikimedia.org/T351468 (10Andrew) That seems to be because there's nothing in /srv/puppet_code/environments where I would expect the puppet source to... [21:46:03] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance metricsinfra-alertmanager-1 in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:11:40] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [22:33:05] 10Toolforge (Quota-requests): Request increased quota for anchor-corrector Toolforge tool - https://phabricator.wikimedia.org/T350484 (10Kanashimi) @taavi So I think I need to increase the quota for performing continuous tasks... [22:34:58] 10Toolforge (Toolforge iteration 02): [envvars-cli] move pytest from tox to pre-commit - https://phabricator.wikimedia.org/T351476 (10Raymond_Ndibe) [23:10:35] (ProbeDown) firing: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:40:35] (ProbeDown) resolved: (4) Service toolsbeta-test-k8s-haproxy-3:30000 has failed probes (http_admin_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [23:49:03] 10Toolforge (Toolforge iteration 02), 10Documentation: [tbs] Create a tutorial on how to deploy a ruby on rails tool using build service - https://phabricator.wikimedia.org/T347402 (10bd808) >>! In T347402#9333908, @Slst2020 wrote: > It is painfully slow though, which is surprising considering that it's just a...