[00:08:56] RECOVERY - Check unit status of purge_vm_backup on cloudbackup1003 is OK: OK: Status of the systemd unit purge_vm_backup https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [00:10:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:13:59] (SystemdUnitDown) resolved: The service unit purge_vm_backup.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [00:20:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:40:21] (03CR) 10Krinkle: [C: 03+2] Protect subtitles of audio files too [labs/tools/fileprotectionsync] - 10https://gerrit.wikimedia.org/r/977271 (owner: 10Legoktm) [00:40:47] (03Merged) 10jenkins-bot: Protect subtitles of audio files too [labs/tools/fileprotectionsync] - 10https://gerrit.wikimedia.org/r/977271 (owner: 10Legoktm) [02:14:57] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [03:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [06:14:58] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [07:30:18] 10Cloud-VPS (Quota-requests), 10MinT, 10Language-Team (Language-2023-October-December): Create large instance for MinT - https://phabricator.wikimedia.org/T352136 (10KartikMistry) [07:30:37] 10Cloud-VPS (Quota-requests), 10MinT, 10Language-Team (Language-2023-October-December): Create large instance for MinT - https://phabricator.wikimedia.org/T352136 (10KartikMistry) [08:18:19] (HAProxyBackendUnavailable) firing: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:23:19] (HAProxyBackendUnavailable) resolved: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [08:48:16] 10cloud-services-team, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342757 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving as the alert is no longer active. [09:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [09:50:15] 10Toolforge: Quota / webservice resource change? - https://phabricator.wikimedia.org/T352251 (10DamianZaremba) [10:08:17] (03CR) 10Hashar: Add CORS header for the json API (031 comment) [labs/tools/train-blockers] - 10https://gerrit.wikimedia.org/r/978139 (owner: 10Hashar) [10:19:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [10:19:04] 10Toolforge (Toolforge iteration 02): Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10Slst2020) [10:19:06] 10Toolforge (Toolforge iteration 02): [builds-api] Use admin user credentials for Harbor API auth in dev - https://phabricator.wikimedia.org/T352022 (10Slst2020) 05Open→03Resolved [10:41:41] 10Grid-Engine-to-K8s-Migration: Migrate listpages from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319860 (10Aka) 05Open→03Resolved [10:47:34] 10Grid-Engine-to-K8s-Migration: Migrate inactiveadmins from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319808 (10Aka) 05Open→03Resolved [10:48:27] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:19:34] (SystemdUnitDown) firing: The systemd unit backup_glance_images.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:19:39] 10cloud-services-team: SystemdUnitDown Unit backup_glance_images.service on node cloudbackup1003 has been down for long. - https://phabricator.wikimedia.org/T352261 (10phaultfinder) [11:31:31] (03Abandoned) 10Jbond: gerrit: add mock secrets [labs/private] - 10https://gerrit.wikimedia.org/r/832264 (owner: 10Jbond) [11:36:42] 10Toolforge: Quota / webservice resource change? - https://phabricator.wikimedia.org/T352251 (10taavi) This is a result of a recent change to the default quotas (T333979) that accidentally lowered `requests.memory` from `6Gi` to `4Gi` (T352055). The idea behind these changes was to eliminate the two different `r... [11:36:47] 10Toolforge: Toolforge Kubernetes quota requests.memory was reduced - https://phabricator.wikimedia.org/T352055 (10taavi) [11:36:56] 10Toolforge: Quota / webservice resource change? - https://phabricator.wikimedia.org/T352251 (10taavi) [11:43:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [11:47:37] (03Abandoned) 10Jbond: used to test Ifa35d19910c9c162ef25c59da55b1588d281bccd [labs/private] - 10https://gerrit.wikimedia.org/r/852994 (owner: 10Jbond) [11:52:41] 10Grid-Engine-to-K8s-Migration: Migrate revertstat from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T320005 (10Aka) 05Open→03Resolved [12:03:27] (OpenstackAPIResponse) firing: (2) Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [12:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [12:49:38] 10PAWS: tofu state file to object storage - https://phabricator.wikimedia.org/T352164 (10github-toolforge-bot) vivian-rook opened https://github.com/toolforge/paws/pull/354 [12:49:44] vivian-rook opened https://github.com/toolforge/paws/pull/354 [13:04:34] (SystemdUnitDown) resolved: The systemd unit backup_glance_images.service on node cloudbackup1003 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:08:49] RECOVERY - Check unit status of backup_glance_images on cloudbackup1003 is OK: OK: Status of the systemd unit backup_glance_images https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:28:28] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:08:01] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Goal: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 (10fnegri) [14:08:03] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2): [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) 05Open→03In progress [14:19:57] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [14:30:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance runner-1027 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:35:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance runner-1025 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:36:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance toolsbeta-puppetdb-02 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:38:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance cloudinfra-acme-chief-01 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:40:03] (PuppetAgentNoResources) firing: (4) No Puppet resources found on instance runner-1025 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:42:18] 10Cloud-VPS, 10cloud-services-team (FY2023/2024-Q1-Q2), 10Patch-For-Review: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 (10fnegri) [14:43:03] (PuppetAgentNoResources) firing: (4) No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:45:03] (PuppetAgentNoResources) firing: (7) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:48:03] (PuppetAgentNoResources) firing: (7) No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:50:03] (PuppetAgentNoResources) firing: (9) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:50:12] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T348843) [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:50:20] T348843: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 [14:53:03] (PuppetAgentNoResources) firing: (8) No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:54:03] (PuppetAgentNoResources) firing: No Puppet resources found on instance tools-puppetdb-1 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:54:18] (03CR) 10Andrew Bogott: [C: 03+1] upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 (owner: 10FNegri) [14:55:03] (PuppetAgentNoResources) firing: (10) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:55:14] (03CR) 10FNegri: [C: 03+2] upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 (owner: 10FNegri) [14:56:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:58:03] (PuppetAgentNoResources) firing: (8) No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:58:19] (HAProxyBackendUnavailable) firing: HAProxy service designate-api_backend backend cloudservices1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:59:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance tools-acme-chief-01 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [14:59:57] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T348843) [15:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:00:03] T348843: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 [15:00:03] (PuppetAgentNoResources) firing: (10) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:00:26] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:00:34] PROBLEM - Check DNS auth via UDP of tools-sgegrid-master.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:00:36] PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:44] PROBLEM - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:00:48] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:00:54] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:01:03] 10Cloud-VPS: [wmcs-cookbooks] unify upgrade_openstack cookbooks - https://phabricator.wikimedia.org/T352297 (10fnegri) [15:02:24] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.011 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:30] RECOVERY - Check DNS auth via UDP of tools-sgegrid-master.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.041 seconds response time (tools-sgegrid-master.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.5.129) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:38] RECOVERY - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.018 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:38] RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:42] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.040 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:02:48] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.060 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:03:03] (PuppetAgentNoResources) firing: (8) No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:03:19] (HAProxyBackendUnavailable) resolved: HAProxy service designate-api_backend backend cloudservices1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:05:03] (PuppetAgentNoResources) firing: (10) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:08:03] (PuppetAgentNoResources) resolved: (8) No Puppet resources found on instance cloud-puppetmaster-03 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:09:03] (PuppetAgentNoResources) firing: (3) No Puppet resources found on instance tools-acme-chief-01 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:09:44] (03PS3) 10FNegri: upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 [15:09:51] (03CR) 10FNegri: [V: 03+2] upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 (owner: 10FNegri) [15:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [15:10:03] (PuppetAgentNoResources) firing: (10) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:11:03] (PuppetAgentNoResources) firing: (2) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:13:35] (03Merged) 10jenkins-bot: upgrade_openstack_node: add runtime description [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/969172 (owner: 10FNegri) [15:14:03] (PuppetAgentNoResources) resolved: (3) No Puppet resources found on instance tools-acme-chief-01 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:15:03] (PuppetAgentNoResources) firing: (10) No Puppet resources found on instance runner-1021 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:15:07] !log admin fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T348843) [15:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:15:16] T348843: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 [15:20:03] (PuppetAgentNoResources) firing: (6) No Puppet resources found on instance runner-1022 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:21:03] (PuppetAgentNoResources) resolved: (2) No Puppet resources found on instance toolsbeta-acme-chief-01 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:25:03] (PuppetAgentNoResources) resolved: (5) No Puppet resources found on instance runner-1025 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:25:19] (HAProxyBackendUnavailable) firing: HAProxy service designate-api_backend backend cloudservices1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:26:58] !log admin fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1006.eqiad.wmnet' (T348843) [15:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [15:27:04] T348843: [openstack] Upgrade eqiad1 cluster to Antelope - https://phabricator.wikimedia.org/T348843 [15:28:31] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:28:31] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:29:49] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.013 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:29:49] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.021 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:30:19] (HAProxyBackendUnavailable) resolved: HAProxy service designate-api_backend backend cloudservices1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:33:09] PROBLEM - Check systemd state on cloudservices1005 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:59] PROBLEM - Check systemd state on cloudservices1006 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:36:33] (SystemdUnitDown) firing: (2) The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:59:07] 10Cloud-VPS (Quota-requests), 10MinT, 10Language-Team (Language-2023-October-December): Create large instance for MinT - https://phabricator.wikimedia.org/T352136 (10Andrew) Hi! We would like to help, but it's hard for us to understand what you're asking here for. - If this is for something already running... [16:59:50] 10Toolforge (Toolforge iteration 02), 10Patch-For-Review: Add `toolforge build quota` command - https://phabricator.wikimedia.org/T341068 (10CodeReviewBot) sstefanova opened https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/62 api: add quota endpoint [17:17:33] 10Cloud-VPS (Quota-requests), 10MinT, 10Language-Team (Language-2023-October-December): Create large instance for MinT - https://phabricator.wikimedia.org/T352136 (10bd808) https://openstack-browser.toolforge.org/project/language is the Cloud VPS project that owns the named https://translate.wmcloud.org/ proxy. [17:30:05] RECOVERY - Check systemd state on cloudservices1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:34] (SystemdUnitDown) firing: (2) The systemd unit labs-ip-alias-dump.service on node cloudservices1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:31:39] 10cloud-services-team: SystemdUnitDown - https://phabricator.wikimedia.org/T352323 (10phaultfinder) [17:33:05] PROBLEM - Check systemd state on cloudservices1006 is CRITICAL: CRITICAL - degraded: The following units failed: labs-ip-alias-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [18:19:58] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [18:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:30:05] RECOVERY - Check systemd state on cloudservices1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:34] (SystemdUnitDown) resolved: (2) The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:36:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:37:45] 10PAWS: tofu state file to object storage - https://phabricator.wikimedia.org/T352164 (10rook) Putting ` backend "s3" { access_key = secret_key = ... } ` In the config works, Though feeding them in with variables `tofu init -backend-config access_key="${ACCESS_KEY}" -backend-con... [20:44:04] (SystemdUnitDown) resolved: (2) The systemd unit labs-ip-alias-dump.service on node cloudservices1005 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:46:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:47:11] RECOVERY - Check systemd state on cloudservices1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:39] 10Wikibugs, 10Phabricator: Case of wikibugs displaying unrelated user when Herald performed an action - https://phabricator.wikimedia.org/T116477 (10Aklapper) 05Open→03Invalid Using https://phabricator.wikimedia.org/conduit/method/maniphest.gettasktransactions/ with `ids` set to `["72869"]`, or https://pha... [21:10:03] (TfInfraTestApplyFailed) firing: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:25:03] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:30:03] (TfInfraTestApplyFailed) resolved: Terraform failed to apply/create the resounces on tf-bastion - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTfInfraTestApplyFailed [21:35:03] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:36:03] (InstanceDown) firing: Project toolsbeta instance toolsbeta-bastion-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:46:45] 10Cloud-VPS, 10cloud-services-team: Investigate new roles and policies in openstack Xena - https://phabricator.wikimedia.org/T276018 (10Andrew) 05Open→03Resolved a:03Andrew This is partially done, and partially blocked by upstream dithering. Either way, this task can be closed. [21:47:24] 10Cloud-VPS, 10Data-Services, 10cloud-services-team, 10User-Marostegui: Investigate, adjust default access policies for Trove and trove-dashboard - https://phabricator.wikimedia.org/T281655 (10Andrew) 05Open→03Resolved a:03Andrew [21:47:30] 10Cloud-VPS, 10Data-Services, 10cloud-services-team (Kanban), 10User-Marostegui: [Feature request] Database as a Service (Trove) for Cloud VPS projects - https://phabricator.wikimedia.org/T212595 (10Andrew) [22:19:58] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [22:49:01] (OpenstackAPIResponse) resolved: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse