[00:07:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:12:28] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [01:34:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [01:46:18] 10Toolforge (Toolforge iteration 14): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#10083518 (10Raymond_Ndibe) >>! In T353701#9887644, @Slst2020 wrote: > This is unlikely to be fixed even with new robot permissions. Only admin accounts can access sy... [02:57:18] 06Toolforge-standards-committee: Adoption request for Yapperbot - https://phabricator.wikimedia.org/T361426#10083579 (10DavidTornheim) 05Stalled→03In progress I'm going to make this a priority. A user just reported that the Feedback Request Service has been shut down here: https://en.wikipedia.org/w/index.p... [03:09:29] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:19:29] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:42:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:05:22] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision Request: To strictly enforce semantic versioning rules for toolforge services' APIs or not - https://phabricator.wikimedia.org/T373072 (10Raymond_Ndibe) 03NEW [04:05:47] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision Request: To strictly enforce semantic versioning rules for toolforge services' APIs or not - https://phabricator.wikimedia.org/T373072#10083668 (10Raymond_Ndibe) [04:06:14] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [04:06:14] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [04:06:22] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision Request: To strictly enforce semantic versioning rules for toolforge services' APIs or not - https://phabricator.wikimedia.org/T373072#10083671 (10Raymond_Ndibe) [04:07:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [04:09:04] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Decision Request: To strictly enforce semantic versioning rules for toolforge services' APIs or not - https://phabricator.wikimedia.org/T373072#10083672 (10Raymond_Ndibe) [04:11:13] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [04:11:14] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [04:17:28] RESOLVED: PuppetSyncFailure: Failed to update Puppet repository /srv/git/labs/private on instance paws-puppetserver-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [04:21:35] 10Toolforge (Toolforge iteration 14): something is wrong with pre-commit on builds-api - https://phabricator.wikimedia.org/T372601#10083673 (10Raymond_Ndibe) a:03Raymond_Ndibe [04:21:51] 10Toolforge (Toolforge iteration 14): something is wrong with pre-commit on golang repos - https://phabricator.wikimedia.org/T372601#10083674 (10Raymond_Ndibe) [04:22:20] 10Toolforge (Toolforge iteration 14): something is wrong with pre-commit on golang repos - https://phabricator.wikimedia.org/T372601#10083676 (10Raymond_Ndibe) 05Open→03Resolved [04:24:29] 10Toolforge (Toolforge iteration 14): Possible error in jobs and cronjobs quotas in maintain-kubeusers - https://phabricator.wikimedia.org/T372720#10083677 (10Raymond_Ndibe) >>! In T372720#10071887, @JJMC89 wrote: > The current quota is fine. 50 cronjobs can be scheduled but only 15 jobs can run concurrently. T... [04:29:19] 10Toolforge: [jobs-cli] Add a new output format for toolforge jobs list command which returns the input command for scheduled jobs - https://phabricator.wikimedia.org/T356581#10083682 (10Raymond_Ndibe) 05Open→03Resolved [04:29:40] 10Toolforge: [jobs-cli] Add a new output format for toolforge jobs list command which returns the input command for scheduled jobs - https://phabricator.wikimedia.org/T356581#10083681 (10Raymond_Ndibe) closing because this has already been implemented and deployed. See the above linked task by Taavi for more... [04:41:29] 10Toolforge: [maintain-harbor] Move to become a toolforge component - https://phabricator.wikimedia.org/T358225#10083687 (10Raymond_Ndibe) If this becomes a toolforge component, I'm thinking about discarding the toolforge-jobs abstraction that we use here and scheduling maintain-harbor jobs directly on k8s throu... [04:44:02] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge, 05Goal: [harbor] Deploy with Helm - https://phabricator.wikimedia.org/T356301#10083688 (10Raymond_Ndibe) a:03Raymond_Ndibe [06:10:42] 06cloud-services-team, 10wikitech.wikimedia.org, 07Epic: Set up a bitu instance for codfw1dev - https://phabricator.wikimedia.org/T360795#10083697 (10SLyngshede-WMF) @Andrew Sorry, this is still pending. I need to figure out what to do about the Redis dependency. The correct solution would be to build a Redi... [06:27:56] FIRING: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:32:56] RESOLVED: [2x] SystemdUnitDown: The service unit wikitech_run_jobs.service is in failed status on host cloudweb1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [07:38:56] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10083783 (10Jgiannelos) Could it be something wr... [07:50:29] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10083799 (10Jgiannelos) I think I found the issu... [08:54:56] 06cloud-services-team: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10083948 (10fnegri) Available RAM graphs for the past 24 hours: {F57285792} {F57285794} It goes down to zero very rapidly, there must be some event that triggers it but I have no idea what it could be. [09:08:29] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:13:29] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:18:15] 06cloud-services-team: toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10084013 (10fnegri) ` [08:53:57] some cloud prometheus servers have started crashing with OOM errors, did you ever see something similar in prod? T370143 [08:53:58] T370143: toolforge: prome... [09:25:49] 10Cloud-Services, 05Cloud-Services-Origin-User: Horizon nova generated ssh key blocks project usage - https://phabricator.wikimedia.org/T373082 (10ABran-WMF) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832... [09:26:02] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key blocks project usage - https://phabricator.wikimedia.org/T373082#10084035 (10ABran-WMF) [09:32:41] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key blocks project usage - https://phabricator.wikimedia.org/T373082#10084048 (10fnegri) @Andrew maybe you have an idea of what's happening here? I tried creating a new VM in that project and under "key pair" I don't see any "Allocated" or "... [09:36:06] 10Toolforge: Java application redeploys several times until it starts - https://phabricator.wikimedia.org/T372092#10084053 (10Benjavalero) 05Open→03Resolved a:03Benjavalero I have moved some initial tasks to be performed just after the Tomcat server has started, and it seems the problem is now solved.... [09:59:41] FIRING: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [10:22:56] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key blocks project usage - https://phabricator.wikimedia.org/T373082#10084182 (10fnegri) I created 4 new VMs and both me and @ABran-WMF can successfully ssh to those. We can also ssh to the `test-wo` and `test-with` instances that Arnaud cre... [10:24:41] RESOLVED: PrometheusRestarted: Prometheus/cloud restarted: beware monitoring artifacts. - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_was_restarted - https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?var-datasource=codfw%20prometheus%2Fcloud - https://alerts.wikimedia.org/?q=alertname%3DPrometheusRestarted [10:25:18] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key blocks project usage - https://phabricator.wikimedia.org/T373082#10084183 (10fnegri) p:05Triage→03Low Setting priority to Low as Arnaud is no longer blocked by this. [11:10:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:23:28] FIRING: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:48:28] RESOLVED: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:50:13] 06cloud-services-team, 10Openstack-Magnum: Hide fedora images from human Horizon users - https://phabricator.wikimedia.org/T356547#10084394 (10fnegri) This is what a normal user sees (captured by @ABran-WMF as part of a different task): {F57286121} [11:56:04] 10Cloud-VPS: cloudinfra-cloudvps-puppetserver-1 became unresponsive - https://phabricator.wikimedia.org/T373092 (10fnegri) 03NEW [11:56:09] 10Cloud-VPS: cloudinfra-cloudvps-puppetserver-1 became unresponsive - https://phabricator.wikimedia.org/T373092#10084414 (10fnegri) 05Open→03Resolved [12:03:43] 10Quarry: Update cluster to 1.26 - https://phabricator.wikimedia.org/T373093 (10rook) 03NEW [12:05:22] 10cloud-services-team (FY2024/2025-Q1-Q2): toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10084468 (10fnegri) tools-prometheus-7 was down for 1 hour. Grafana shows a spike in disk activity before it stopped reporting. {F57286135} Soft-rebooting it from Horizon brought it back onl... [12:05:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [12:05:46] 10cloud-services-team (FY2024/2025-Q1-Q2): toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10084471 (10fnegri) 05Open→03In progress [12:06:01] 10Cloud-Services: Redeploy bastion for tf-infra-test in codfw1dev - https://phabricator.wikimedia.org/T373094 (10rook) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific pro... [12:06:56] 10VPS-Projects: Redeploy bastion for tf-infra-test in codfw1dev - https://phabricator.wikimedia.org/T373094#10084489 (10rook) [12:07:26] 10cloud-services-team (FY2024/2025-Q1-Q2): toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10084491 (10fnegri) We're already running the latest version available in our Debian repository: ` fnegri@apt1002:~$ sudo -i reprepro ls prometheus prometheus | 2.24.1+ds-1+wmf2 | buster-wi... [12:11:02] 10cloud-services-team (FY2024/2025-Q1-Q2): toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10084519 (10fnegri) Maybe just a coincidence, but we had another instance becoming unresponsive today, not related to Prometheus: {T373092} There was also an OOM error yesterday on `tools-db-... [12:25:58] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key blocks project usage - https://phabricator.wikimedia.org/T373082#10084621 (10ABran-WMF) [12:26:08] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key undeletable - https://phabricator.wikimedia.org/T373082#10084624 (10ABran-WMF) [12:26:40] 10Cloud-VPS, 05Cloud-Services-Origin-User: Horizon nova generated ssh key undeletable - https://phabricator.wikimedia.org/T373082#10084629 (10ABran-WMF) >>! In T373082#10084183, @fnegri wrote: > Setting priority to Low as Arnaud is no longer blocked by this. edited the task to match the current state [12:26:54] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10084630 (10fnegri) All hosts are now running Bookworm, I will keep this task open until I've also upgraded MariaDB to version 10.6.19. @Marostegui what is the p... [12:29:30] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services, 05Goal: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424#10084648 (10Marostegui) What I normally do is: - Stop slave on each instance - Stop each instance's daemon (never all of them at the same time): `systemctl stop m... [12:48:39] FIRING: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:53:39] RESOLVED: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:02:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:24:12] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry: Allow Quarry to query its own database - https://phabricator.wikimedia.org/T367415#10084871 (10github-toolforge-bot) dhinus closed https://github.com/toolforge/quarry/pull/61 [13:24:19] dhinus closed https://github.com/toolforge/quarry/pull/61 [13:34:02] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Data-Services: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778#10084899 (10fnegri) [13:35:47] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry: Allow Quarry to query its own database - https://phabricator.wikimedia.org/T367415#10084893 (10fnegri) 05In progress→03Resolved This is working and you can now type `quarry` or `quarry_p` in the "db name" input. `quarry_p` will always be used even if... [13:44:54] 10superset.wmcloud.org: Store tofu state in Object Storage - https://phabricator.wikimedia.org/T373111 (10fnegri) 03NEW [13:52:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:22:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:27:07] dhinus opened https://github.com/toolforge/quarry/pull/66 [14:33:07] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10085178 (10Jclark-ctr) 05Open→03Resolved verified cable and link lights [14:35:18] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085187 (10fnegri) > to solve the problem of manually listing the dbs, we could just call the db "ToolsDB" in Superset, and ask users to start all the... [14:39:02] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085201 (10fnegri) There's one last quirk before we can resolve this task, I updated the database name to `ToolsDB`, but SuperSet has cached the old n... [14:43:45] FIRING: ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_toolserver_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:47:22] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085222 (10KCVelaga_WMF) @fnegri `s55986__automod_metrics_p` was just for testing, there is no real data in there. If it helps, I can delete it. [14:48:45] RESOLVED: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:49:27] (03CR) 10FNegri: [C:03+1] toolforge.component.deploy: remove the k8s prefix [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1059890 (owner: 10David Caro) [14:50:07] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085244 (10rook) How does it look now? [14:55:24] 10Toolforge: [toolforge] [envvars] TOOL_REPLICA_USER and TOOL_TOOLSDB_USER missing for new tool - https://phabricator.wikimedia.org/T372640#10085269 (10fnegri) @dcaro apologies but I didn't find the time to do further investigations this week. [15:02:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Kubernetes worker tools-k8s-worker-nfs-26 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:09:51] 10cloud-services-team (FY2024/2025-Q1-Q2): toolforge: prometheus server died - https://phabricator.wikimedia.org/T370143#10085293 (10Andrew) This happened again last night with -6, syslog was totally silent during the outage as though the VM was frozen. [15:11:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:20:42] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085335 (10fnegri) @rook it's fixed! Thank you :) [15:21:20] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085336 (10fnegri) @KCVelaga_WMF the UI issue is fixed, but it's always good to clean up unused dbs, so feel free to delete it at your convenience! [15:21:23] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085337 (10fnegri) 05In progress→03Resolved [15:28:13] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10085369 (10Eevans) I found this in the logs: `... [15:45:40] 10PAWS, 07Security: update ingress-nginx for CVE-2024-7646 - https://phabricator.wikimedia.org/T373124 (10rook) 03NEW [15:45:50] 10PAWS, 07Security: update ingress-nginx for CVE-2024-7646 - https://phabricator.wikimedia.org/T373124#10085445 (10rook) [16:00:09] 06cloud-services-team, 10Cloud-VPS (Debian Buster Deprecation), 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation) - https://phabricator.wikimedia.org/T370460#10085514 (10Eevans) > @Jgiannelos is there no wa... [16:15:29] 10cloud-services-team (FY2024/2025-Q1-Q2), 10superset.wmcloud.org: Allow Superset to query ToolsDB public databases - https://phabricator.wikimedia.org/T367393#10085629 (10github-toolforge-bot) dhinus closed https://github.com/toolforge/superset-deploy/pull/27 [16:15:43] dhinus closed https://github.com/toolforge/superset-deploy/pull/27 [16:16:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:36:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.set_maintenance (T369044) [16:36:34] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [16:38:30] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudweb.set_maintenance (exit_code=0) (T369044) [16:39:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T369044) [16:44:20] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Quarry: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158#10085688 (10fnegri) All the subtasks are now completed and you can use Quarry to query ToolsDB's public dbs and Quarry's internal db. I will keep this task... [16:50:21] PROBLEM - Host cloudservices1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:07] RECOVERY - Host cloudservices1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [16:52:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1005.eqiad.wmnet' (T369044) [16:52:15] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [16:52:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T369044) [16:53:13] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:53:13] PROBLEM - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:54:05] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:54:11] PROBLEM - Bird Internet Routing Daemon on cloudservices1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:54:19] PROBLEM - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:54:19] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:54:57] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.035 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:55:03] RECOVERY - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.020 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:55:03] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.024 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:55:09] RECOVERY - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.026 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:55:09] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.026 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:55:11] RECOVERY - Bird Internet Routing Daemon on cloudservices1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:02:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1005.eqiad.wmnet' (T369044) [17:02:45] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [17:04:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T369044) [17:08:53] 06cloud-services-team, 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26 - https://phabricator.wikimedia.org/T327025#10085754 (10fnegri) [17:11:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:13:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1006.eqiad.wmnet' (T369044) [17:13:30] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [17:15:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T369044) [17:16:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:18:28] FIRING: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:25:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1006.eqiad.wmnet' (T369044) [17:25:15] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [17:25:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T369044) [17:28:17] vivian-rook opened https://github.com/toolforge/paws/pull/449 [17:38:20] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudcontrol1005.eqiad.wmnet' (T369044) [17:38:25] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [17:41:43] 10PAWS: PR not posting to phabricator - https://phabricator.wikimedia.org/T373134 (10rook) 03NEW [17:42:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T369044) [17:43:19] vivian-rook closed https://github.com/toolforge/paws/pull/449 [17:49:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [17:51:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:51:22] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [17:52:44] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1005.eqiad.wmnet' (T369044) [17:52:55] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [17:54:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1006.eqiad.wmnet' (T369044) [17:56:10] RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [17:56:22] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:01:22] RESOLVED: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:10:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1006.eqiad.wmnet' (T369044) [18:10:54] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [18:11:26] RESOLVED: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:11:37] FIRING: [28x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:11:52] FIRING: [28x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:12:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1007.eqiad.wmnet' (T369044) [18:16:37] RESOLVED: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:18:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:23:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:28:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:28:52] FIRING: [24x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:31:37] FIRING: [19x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:32:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1007.eqiad.wmnet' (T369044) [18:32:34] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [18:33:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1006.eqiad.wmnet' (T369044) [18:33:10] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [18:53:49] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet1005.eqiad.wmnet' (T369044) [18:53:54] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [18:54:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [18:57:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [19:00:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1052' (T369044) [19:00:14] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=99) on host 'cloudvirt1052' (T369044) [19:00:19] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:00:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1033' (T369044) [19:00:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=99) on host 'cloudvirt1033' (T369044) [19:01:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1033.eqiad.wmnet' (T369044) [19:08:52] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1033.eqiad.wmnet' (T369044) [19:08:58] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:13:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1035' (T369044) [19:13:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=99) on host 'cloudvirt1035' (T369044) [19:13:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1037' (T369044) [19:14:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=99) on host 'cloudvirt1037' (T369044) [19:14:00] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1039' (T369044) [19:14:01] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=97) on host 'cloudvirt1039' (T369044) [19:14:03] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:14:31] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1035.eqiad.wmnet' (T369044) [19:15:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.unset_maintenance (T369044) [19:17:04] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudweb.unset_maintenance (exit_code=0) (T369044) [19:20:41] FIRING: CloudVPSDesignateLeaks: Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:21:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1035.eqiad.wmnet' (T369044) [19:21:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1037.eqiad.wmnet' (T369044) [19:21:49] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:28:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1037.eqiad.wmnet' (T369044) [19:28:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1039.eqiad.wmnet' (T369044) [19:28:49] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:29:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [19:29:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [19:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:35:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1039.eqiad.wmnet' (T369044) [19:35:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1034.eqiad.wmnet' (T369044) [19:35:07] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:42:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1034.eqiad.wmnet' (T369044) [19:42:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1036.eqiad.wmnet' (T369044) [19:42:46] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:49:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1036.eqiad.wmnet' (T369044) [19:49:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1040.eqiad.wmnet' (T369044) [19:49:30] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [19:56:02] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1040.eqiad.wmnet' (T369044) [19:56:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1045.eqiad.wmnet' (T369044) [19:56:08] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:02:52] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1045.eqiad.wmnet' (T369044) [20:02:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1043.eqiad.wmnet' (T369044) [20:02:57] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:05:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [20:06:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [20:09:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1043.eqiad.wmnet' (T369044) [20:09:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1046.eqiad.wmnet' (T369044) [20:09:26] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:15:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1046.eqiad.wmnet' (T369044) [20:15:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1041.eqiad.wmnet' (T369044) [20:16:01] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:16:22] FIRING: HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:21:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:22:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1041.eqiad.wmnet' (T369044) [20:22:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1044.eqiad.wmnet' (T369044) [20:22:39] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:29:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1044.eqiad.wmnet' (T369044) [20:29:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1042.eqiad.wmnet' (T369044) [20:29:23] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:36:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1042.eqiad.wmnet' (T369044) [20:36:25] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1038.eqiad.wmnet' (T369044) [20:36:30] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:43:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1038.eqiad.wmnet' (T369044) [20:43:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1047.eqiad.wmnet' (T369044) [20:43:52] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [20:46:08] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=97) on host 'cloudvirt1047.eqiad.wmnet' (T369044) [21:03:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [21:03:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [21:14:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.set_maintenance (T369044) [21:14:23] T369044: Upgrade cloud-vps openstack to version 'Caracal' - https://phabricator.wikimedia.org/T369044 [21:16:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudweb.set_maintenance (exit_code=0) (T369044) [21:43:21] 06Toolforge-standards-committee: Adoption request for Yapperbot - https://phabricator.wikimedia.org/T361426#10086509 (10bd808) >>! In T361426#10083579, @DavidTornheim wrote: > I'm going to make this a priority. A user just reported that the Feedback Request Service has been shut down here: > https://en.wikipedi... [21:44:15] 06Toolforge-standards-committee: Adoption request for Yapperbot - https://phabricator.wikimedia.org/T361426#10086510 (10bd808) >>! In T361426#10086509, @bd808 wrote: > Someone wrote to the page that causes the bot to halt. The edit looks like a vandal: https://en.wikipedia.org/w/index.php?title=User:Yapperbot/ki... [21:49:49] 10PAWS, 07SecTeam-Processed, 07Security, 07Vuln-VulnComponent: update ingress-nginx for CVE-2024-7646 - https://phabricator.wikimedia.org/T373124#10086518 (10sbassett) [21:51:17] 10PAWS, 07SecTeam-Processed, 07Security, 07Vuln-VulnComponent: update ingress-nginx for CVE-2024-7646 - https://phabricator.wikimedia.org/T373124#10086519 (10sbassett) p:05Triage→03Medium [21:51:44] 10PAWS, 07SecTeam-Processed, 07Security, 07Vuln-VulnComponent: update ingress-nginx for CVE-2024-7646 - https://phabricator.wikimedia.org/T373124#10086534 (10sbassett) p:05Medium→03High [23:08:28] FIRING: InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:12:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [23:12:06] 06cloud-services-team: NovafullstackSustainedFailures The automated tests were unable to create, provision and decommission a VM in the last 5h - https://phabricator.wikimedia.org/T373155 (10phaultfinder) 03NEW [23:13:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:20:41] FIRING: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:28:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:38:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [23:41:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack [23:41:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) [23:54:00] FIRING: OpenstackAPIResponse: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:59:22] FIRING: [3x] HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable