[01:52:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [02:05:19] (03update) 10ahecht: Draft: Cache database queries [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/3 [02:07:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:34:42] 10Tool-fa-speed: [Roblox] Welcome screen becomes undissmissable - https://phabricator.wikimedia.org/T390809 (10derenrich) 03NEW [03:36:12] 10Tool-fa-speed: [Roblox] Welcome screen becomes undissmissable - https://phabricator.wikimedia.org/T390809#10702155 (10derenrich) [04:07:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:27:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [05:26:41] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885#10702220 (10Legoktm) a:03Legoktm Let's do it. Here's what I rigged up locally (will push to GitLab in a bit). {F58963666} Still need to create the group pages. For the compare interface, I... [05:27:19] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885#10702223 (10Legoktm) Also, is "puppet groups" a good label for this? Is there a better term? [08:01:14] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885#10702404 (10taavi) I've often used "admin groups" based on the name of the Puppet module handling that. [08:13:35] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822 (10fgiunchedi) 03NEW [08:21:38] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10702507 (10aborrero) a:03Andrew Good idea, maybe @Andrew will be interested in looking into this. [08:21:46] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10702509 (10aborrero) p:05Triage→03Medium [08:36:51] 10Cloud-Services: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824 (10hashar) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile... [08:38:52] 10Cloud-Services: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824#10702572 (10hashar) [08:43:37] 10Cloud-Services: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824#10702589 (10hashar) The instance that got recently created run on 5.10.0 while the older are on 6.1.0: ` $ sudo cumin --force 'name:docker' 'uname -r' 26 hosts... [08:50:09] 10Cloud-Services: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824#10702610 (10hashar) On integration-agent-docker-1044: ` reboot system boot 6.1.0-0.deb11.7- Tue Jul 4 14:30:48 2023 - Tue Jul 4 14:32:14 2023 (00:01) hash... [09:01:58] 06cloud-services-team, 10Data-Services: Remove the compatibility layer of block schema in wikireplicas - https://phabricator.wikimedia.org/T390767#10702661 (10Aklapper) [09:05:21] 06cloud-services-team, 10Cloud-VPS: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824#10702688 (10taavi) [09:18:48] 06cloud-services-team, 10Data-Services: Remove the compatibility layer of block schema in wikireplicas - https://phabricator.wikimedia.org/T390767#10702742 (10fnegri) @Ladsgroup +1 from me. [09:20:08] 06cloud-services-team, 10Cloud-VPS: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824#10702747 (10hashar) a:03hashar I'll clean them up. [09:52:02] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:55:33] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [09:58:05] 06cloud-services-team, 10Toolforge, 03Wikimedia-Hackathon-2025: [Session] Introducing and exploring Toolforge UI with prospective users - https://phabricator.wikimedia.org/T383149#10702940 (10Sarai-WMF) [09:59:39] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:00:30] 06cloud-services-team, 10Toolforge, 03Wikimedia-Hackathon-2025: [Session] Introducing and exploring Toolforge UI with prospective users - https://phabricator.wikimedia.org/T383149#10702944 (10Sarai-WMF) [10:03:11] 06cloud-services-team, 10Toolforge, 03Wikimedia-Hackathon-2025: [Session] Introducing and exploring Toolforge UI with prospective users - https://phabricator.wikimedia.org/T383149#10702946 (10Sarai-WMF) Thank you, @debt! What about Friday at 4pm, tentatively? @dcaro is unfortunately on sick leave right now,... [10:03:51] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:27:05] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:28:24] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [10:34:59] 06cloud-services-team, 10Dumps-Generation, 13Patch-For-Review: Restructure rsyncs of dumps to the labstore boxes - https://phabricator.wikimedia.org/T254856#10703080 (10BTullis) →14Duplicate dup:03T389784 [10:35:51] 06cloud-services-team, 10Dumps-Generation, 13Patch-For-Review: Restructure rsyncs of dumps to the labstore boxes - https://phabricator.wikimedia.org/T254856#10703086 (10BTullis) Closing and merging into {T389784}, which is part of a wider project to move dumps to Airflow and Kubernetes. We will be restru... [10:49:48] 06cloud-services-team, 10Cloud-VPS: On WMCS linux-perf must be installed from backports to be in sync with linux-image package - https://phabricator.wikimedia.org/T390824#10703142 (10hashar) 05Open→03Resolved This is the runbook I have been using: ` # Find grub entries based on https://wiki.debian.org... [11:22:31] 06cloud-services-team, 10Toolforge: Environment variables are not being passed - https://phabricator.wikimedia.org/T390845 (10Nokib_Sarkar) 03NEW [11:23:36] 06cloud-services-team, 10Toolforge: Environment variables are not being passed - https://phabricator.wikimedia.org/T390845#10703211 (10Nokib_Sarkar) [11:29:11] 06cloud-services-team, 10Toolforge: Environment variables are not being passed - https://phabricator.wikimedia.org/T390845#10703221 (10Nokib_Sarkar) [11:33:14] 06cloud-services-team, 10Toolforge: Environment variables are not being passed - https://phabricator.wikimedia.org/T390845#10703230 (10aborrero) the procfile may not support setting envvars like that. If that's true, I see two options: * use the toolforge envvars, they work fine as you discovered * wrap the ca... [11:35:37] (03update) 10aborrero: tofu-infra: refactor flavors into newer shape [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/163 (https://phabricator.wikimedia.org/T375283) [11:42:35] (03open) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [11:43:07] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [11:43:36] 06cloud-services-team, 10Cloud-VPS, 06serviceops: OOM livelock stalls - https://phabricator.wikimedia.org/T358634#10703281 (10jijiki) [11:45:54] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [11:46:22] 06cloud-services-team, 10Cloud-VPS, 06serviceops: OOM livelock stalls - https://phabricator.wikimedia.org/T358634#10703290 (10jijiki) 05Open→03Stalled [11:48:15] (03update) 10aborrero: flavors: move eqiad1 flavors [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/166 (https://phabricator.wikimedia.org/T375283) [12:02:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:17:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:37:30] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-48 [12:42:52] !log fnegri@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-48 [12:53:17] 06cloud-services-team, 10Toolforge: Support hosting Rust tools on Toolforge - https://phabricator.wikimedia.org/T194953#10703518 (10fnegri) 05Open→03Resolved a:03fnegri > Anything left to do here? I'll take "zero comments in one year" as a no :) If additional features are needed, please open more s... [12:57:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:57:22] 06cloud-services-team, 10Toolforge: Environment variables are not being passed - https://phabricator.wikimedia.org/T390845#10703526 (10fnegri) Our [Rust tutorial](https://wikitech.wikimedia.org/wiki/Help:Toolforge/Building_container_images/My_first_Buildpack_Rust_tool#Step_2:_Create_a_basic_Rocket_webservice)... [13:54:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.set_maintenance (T381499) [13:54:34] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [13:56:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudweb.set_maintenance (exit_code=0) (T381499) [13:57:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:00:32] (03update) 10raymond-ndibe: [jobs-api] custom resource definition deployment templates [repos/cloud/toolforge/jobs-api] (split_logic_from_api) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/101 (https://phabricator.wikimedia.org/T359650) [14:04:13] PROBLEM - Host cloudservices1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:49] 06cloud-services-team, 10Toolforge, 07Epic: [jobs-api] add -l|--last to toolforge jobs logs ... - https://phabricator.wikimedia.org/T388088#10703925 (10aborrero) p:05Triage→03Medium [14:04:57] 06cloud-services-team, 10Cloud-VPS, 06serviceops: OOM livelock stalls - https://phabricator.wikimedia.org/T358634#10703926 (10Andrew) p:05Triage→03Medium [14:05:03] 06cloud-services-team, 10Toolforge, 07Epic: [jobs-api] jobs-api should be able to read webservices started with toolforge webservice - https://phabricator.wikimedia.org/T388090#10703927 (10aborrero) p:05Triage→03Medium [14:05:15] 06cloud-services-team, 10Toolforge, 07Epic: [jobs-api] add toolforge jobs shell feature - https://phabricator.wikimedia.org/T388091#10703928 (10joanna_borun) p:05Triage→03Medium [14:05:23] 06cloud-services-team, 10Toolforge, 07Epic: [jobs-api] expose jobs-api continuous jobs to the internet via `toolname.toolforge.org`, just like webservice - https://phabricator.wikimedia.org/T388092#10703929 (10aborrero) p:05Triage→03Medium [14:05:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:05:41] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [14:05:51] RECOVERY - Host cloudservices1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:06:04] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudbackup1004:9100 - https://phabricator.wikimedia.org/T388310#10703934 (10Andrew) 05Open→03Resolved a:03Andrew This was me, and was re-enabled (and is now disabled again for the moment while I upgrade) [14:06:36] 06cloud-services-team, 10Toolforge: Refactor wmcs-k8s-metrics component - https://phabricator.wikimedia.org/T388382#10703938 (10joanna_borun) p:05Triage→03Medium [14:06:37] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T389701#10703939 (10aborrero) 05Open→03Resolved a:03aborrero [14:06:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T381499) [14:06:59] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:06:59] PROBLEM - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:00] PROBLEM - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:07] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T388676#10703944 (10aborrero) 05Open→03Resolved a:03aborrero [14:07:19] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol1005:9100 - https://phabricator.wikimedia.org/T389793#10703946 (10aborrero) 05Open→03Resolved a:03aborrero [14:07:49] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:07:59] PROBLEM - Bird Internet Routing Daemon on cloudservices1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:08:19] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:08:29] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:29] PROBLEM - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:40] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.024 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:48] RECOVERY - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.029 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:48] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.032 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:50] RECOVERY - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.033 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:08:58] RECOVERY - Bird Internet Routing Daemon on cloudservices1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:09:18] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is OK: OK: UP (pid=3874) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:09:18] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.065 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:09:20] RECOVERY - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.043 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:09:45] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services: [wikireplicas] Create views for new wiki tlwikisource - https://phabricator.wikimedia.org/T388657#10703972 (10fnegri) p:05Triage→03Medium a:03fnegri [14:09:58] 06cloud-services-team, 10Toolforge: [jobs-api] Split the `*Job` API models into three - https://phabricator.wikimedia.org/T390136#10703979 (10joanna_borun) p:05Triage→03Medium [14:10:20] 06cloud-services-team, 10Toolforge: [jobs-api] Introduce deprecation metrics - https://phabricator.wikimedia.org/T390137#10703982 (10joanna_borun) p:05Triage→03Medium [14:11:05] 06cloud-services-team: TooManyCloudvirtsDown # page Reduced availability for CloudVPS eqiad - https://phabricator.wikimedia.org/T390183#10703983 (10Andrew) 05Open→03Resolved a:03Andrew [14:11:07] 06cloud-services-team: CephClusterInUnknown # page Ceph cluster in eqiad is in unknown status - https://phabricator.wikimedia.org/T390184#10703985 (10Andrew) 05Open→03Resolved a:03Andrew [14:11:20] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T390185#10704003 (10Andrew) 05Open→03Resolved a:03Andrew fired during network outage [14:11:41] 06cloud-services-team, 10Toolforge: [toolforge] increase worker sizes in tools - https://phabricator.wikimedia.org/T390228#10704006 (10joanna_borun) p:05Triage→03Medium [14:11:42] 06cloud-services-team: KernelErrors Server cloudcephmon1006 logged kernel errors - https://phabricator.wikimedia.org/T390198#10704008 (10Andrew) 05Open→03Resolved a:03Andrew fired during network outage [14:11:45] 06cloud-services-team: RabbitmqNetworkPartition A Rabbitmq Network partition has been detected. 1 hosts marked as partitioned. - https://phabricator.wikimedia.org/T390190#10704012 (10Andrew) 05Open→03Resolved a:03Andrew fired during network outage [14:11:51] 06cloud-services-team: CephSlowOps Ceph cluster in eqiad has 166621 slow ops - https://phabricator.wikimedia.org/T390188#10704016 (10Andrew) 05Open→03Resolved a:03Andrew fired during network outage [14:11:56] 06cloud-services-team: MetricsinfraAlertmanagerDown Metricsinfra alertmanager is unreachable # page - https://phabricator.wikimedia.org/T390187#10704020 (10Andrew) 05Open→03Resolved a:03Andrew fired during network outage [14:12:15] 06cloud-services-team: CloudVirtDown - https://phabricator.wikimedia.org/T390182#10704027 (10Andrew) 05Open→03Resolved a:03Andrew fired during network outage [14:12:29] (03open) 10aborrero: eqiad1: create zuul3 project [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/167 (https://phabricator.wikimedia.org/T390081) [14:12:45] 06cloud-services-team, 10Data-Services: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714#10704033 (10joanna_borun) p:05Triage→03Low [14:13:01] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714#10704034 (10fnegri) a:03fnegri [14:13:09] 06cloud-services-team, 10PAWS: Duplicate eqiad1 PAWS setup in codfw1dev - https://phabricator.wikimedia.org/T390726#10704036 (10Andrew) p:05Triage→03Medium [14:13:24] 06cloud-services-team, 10Tool-suggestbotbn, 10Toolforge: Ssh login to `login.toolforge.org` failing for uid=shohag - https://phabricator.wikimedia.org/T390614#10704039 (10fnegri) @ShohagS can you confirm if you can now log in successfully? [14:13:45] 06cloud-services-team, 10Tool-suggestbotbn, 10Toolforge: Ssh login to `login.toolforge.org` failing for uid=shohag - https://phabricator.wikimedia.org/T390614#10704040 (10fnegri) p:05Triage→03Low [14:13:56] PROBLEM - Host cloudservices1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:59] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10704042 (10aborrero) p:05Triage→03Medium a:03aborrero [14:15:06] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10704046 (10fnegri) +1 from me [14:15:09] 06cloud-services-team, 10Toolforge: Environment variables are not being passed - https://phabricator.wikimedia.org/T390845#10704047 (10joanna_borun) p:05Triage→03Low [14:15:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1006.eqiad.wmnet' (T381499) [14:15:32] RECOVERY - Host cloudservices1006 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [14:15:34] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [14:16:34] PROBLEM - Check DNS auth via TCP of www.wmcloud.org on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:16:34] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:16:34] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:21] RECOVERY - Check DNS auth via TCP of www.wmcloud.org on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.090 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:23] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.060 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:17:25] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.138 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:17] (03approved) 10fnegri: eqiad1: create zuul3 project [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/167 (https://phabricator.wikimedia.org/T390081) (owner: 10aborrero) [14:18:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [14:18:40] (03merge) 10aborrero: eqiad1: create zuul3 project [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/167 (https://phabricator.wikimedia.org/T390081) [14:18:48] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [14:20:06] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [14:20:34] !log aborrero@cloudcumin1001 zuul3 START - Cookbook wmcs.vps.add_user_to_project for user 'bd808' in role 'member' [14:20:40] !log aborrero@cloudcumin1001 zuul3 END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'bd808' in role 'member' [14:21:09] !log aborrero@cloudcumin1001 zuul3 START - Cookbook wmcs.vps.add_user_to_project for user 'hashar' in role 'member' [14:21:15] !log aborrero@cloudcumin1001 zuul3 END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'hashar' in role 'member' [14:21:23] !log aborrero@cloudcumin1001 zuul3 START - Cookbook wmcs.vps.add_user_to_project for user 'dduvall' in role 'member' [14:21:29] !log aborrero@cloudcumin1001 zuul3 END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'dduvall' in role 'member' [14:21:35] !log aborrero@cloudcumin1001 zuul3 START - Cookbook wmcs.vps.add_user_to_project for user 'thcipriani' in role 'member' [14:21:40] !log aborrero@cloudcumin1001 zuul3 END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'thcipriani' in role 'member' [14:22:24] 06cloud-services-team, 10Cloud-VPS (Project-requests), 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: Request creation of zuul3 VPS project - https://phabricator.wikimedia.org/T390081#10704124 (10aborrero) 05Open→03Resolved a:03aborrero [14:23:17] (03CR) 10MVernon: "Hi," [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:26:44] 06cloud-services-team, 10Toolforge: [jobs-api] add -l|--last to toolforge jobs logs ... - https://phabricator.wikimedia.org/T388088#10704167 (10taavi) [14:26:55] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=97) on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [14:26:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [14:27:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [14:27:01] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [14:27:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [14:27:21] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [14:27:33] 06cloud-services-team, 10Toolforge, 07Epic: [jobs-api] add toolforge jobs shell feature - https://phabricator.wikimedia.org/T388091#10704175 (10taavi) →14Duplicate dup:03T311917 [14:27:40] 06cloud-services-team, 10Toolforge: [webservice,toolforge-cli] Make `webservice shell` a standalone tool - https://phabricator.wikimedia.org/T311917#10704177 (10taavi) [14:27:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T381499) [14:27:49] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1006.eqiad.wmnet' (T381499) [14:28:01] 06cloud-services-team, 10Toolforge: [jobs-api] add toolforge jobs shell feature - https://phabricator.wikimedia.org/T388091#10704179 (10taavi) [14:28:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:28:21] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:28:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:28:56] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:29:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:29:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:30:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:30:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:31:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:31:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:32:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [14:32:22] FIRING: HAProxyBackendUnavailable: HAProxy service mysql backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:32:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:32:41] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [14:34:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:34:50] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:04] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:06] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:36] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:46] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:35:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:00] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1007.eqiad.wmnet' (T381499) [14:36:10] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1007.eqiad.wmnet' (T381499) [14:36:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:19] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:34] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:37] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:42] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:46] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:57] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:36:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:37:10] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [14:37:22] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:38:30] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10704262 (10aborrero) this is what I did: Using horizon, in the `project-proxy` project, in the hiera prefix `project-proxy-acme-chief`: `lang=yaml profile::acme_chief::certi... [14:41:43] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10704263 (10aborrero) 05Open→03Resolved I think this is done, but most likely you need your part of the configuration, as mentioned here: https://wikitech.wikimedia.org... [14:42:22] RESOLVED: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:47:25] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [14:47:52] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:48:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:49:00] RESOLVED: HAProxyBackendUnavailable: HAProxy service heat-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:50:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:50:07] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:50:11] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [14:52:52] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:54:00] FIRING: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:54:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:54:40] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [14:54:47] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:54:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:55:02] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:55:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:55:10] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:55:16] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [14:56:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:56:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:56:20] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:56:43] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:56:50] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10704331 (10taavi) 05Resolved→03Open Re-opening because this did not work: ` Apr 02 14:48:38 project-proxy-acme-chief-02 acme-chief-backend[583]: Handling new certificate e... [14:57:12] RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [14:57:12] RESOLVED: HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:57:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:57:40] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:57:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:57:47] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:57:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:57:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:58:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:58:47] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:59:05] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:59:07] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:59:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:59:29] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:59:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [14:59:57] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:10] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:19] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [15:00:23] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:31] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:50] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:00:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:01:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:01:27] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:01:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:01:47] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:01:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:12] FIRING: [18x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:02:15] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:19] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:22] FIRING: [4x] HAProxyServiceUnavailable: HAProxy service nova-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:02:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:27] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T390877 (10phaultfinder) 03NEW [15:02:47] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:02:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:03:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:03:05] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:04:04] (03PS1) 10Andrew Bogott: upgrade_openstack_node: don't lock tables when backing up [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 [15:04:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:04:10] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:04:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:07:12] FIRING: [18x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:07:22] RESOLVED: [4x] HAProxyServiceUnavailable: HAProxy service nova-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [15:07:38] (03CR) 10CI reject: [V:04-1] upgrade_openstack_node: don't lock tables when backing up [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 (owner: 10Andrew Bogott) [15:11:03] PROBLEM - Host cloudservices1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [15:11:33] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [15:11:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:11:53] RECOVERY - Host cloudservices1005 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:12:47] (03PS2) 10Andrew Bogott: upgrade_openstack_node: don't lock tables when backing up [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 [15:13:47] PROBLEM - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:13:59] PROBLEM - Bird Internet Routing Daemon on cloudservices1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:13:59] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:00] PROBLEM - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:01] PROBLEM - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:03] PROBLEM - Check DNS auth via TCP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:23] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:14:33] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:33] PROBLEM - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:51] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:51] RECOVERY - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.033 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:53] RECOVERY - Check DNS auth via TCP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.076 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:14:59] RECOVERY - Bird Internet Routing Daemon on cloudservices1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:15:23] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is OK: OK: UP (pid=3882) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:15:23] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.043 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:24] RECOVERY - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.077 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:37] RECOVERY - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.026 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:41] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.029 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:49] RECOVERY - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.026 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:49] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.031 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:16:44] (03CR) 10CI reject: [V:04-1] upgrade_openstack_node: don't lock tables when backing up [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1133432 (owner: 10Andrew Bogott) [15:16:45] 06cloud-services-team, 10Tool-suggestbotbn, 10Toolforge: Ssh login to `login.toolforge.org` failing for uid=shohag - https://phabricator.wikimedia.org/T390614#10704528 (10ShohagS) >>! In T390614#10704039, @fnegri wrote: > @ShohagS can you confirm if you can now log in successfully? Oh I am sorry and thank... [15:20:08] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [15:20:15] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [15:25:10] FIRING: ProjectProxyMainProxyDown: Proxy on proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [15:27:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:27:58] FIRING: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [15:29:00] FIRING: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:30:10] RESOLVED: ProjectProxyMainProxyDown: Proxy on proxy-03 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyDown [15:32:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [15:32:58] RESOLVED: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [15:37:01] 06cloud-services-team, 10Cloud-VPS: Enable use of web proxy for wikipeoplestats.org domain - https://phabricator.wikimedia.org/T390800#10704649 (10aborrero) >>! In T390800#10704331, @taavi wrote: > [..] > [[ https://wikitech.wikimedia.org/wiki/Help:Using_a_web_proxy_to_reach_Cloud_VPS_servers_from_the_internet... [15:37:39] 06cloud-services-team, 10Tool-suggestbotbn, 10Toolforge: Ssh login to `login.toolforge.org` failing for uid=shohag - https://phabricator.wikimedia.org/T390614#10704651 (10fnegri) 05Open→03Resolved a:03fnegri [15:38:19] FIRING: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [15:38:28] FIRING: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [15:43:19] RESOLVED: HighIOWaitStalling: High iowait detected on clouddumps1002:9100. - https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#Dumps - https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view - https://alerts.wikimedia.org/?q=alertname%3DHighIOWaitStalling [15:45:04] 10Tool-documentation, 03WMA-Hackathon-2025: [Translate / traduire ou creer documentation en FR] Translat-a-thon outil - https://phabricator.wikimedia.org/T390393#10704663 (10TBurmeister) [15:52:12] FIRING: [2x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [15:52:37] 10Tool-documentation, 03WMA-Hackathon-2025: WMAHack25: Wikimedia Tool Documentation - https://phabricator.wikimedia.org/T390349#10704688 (10TBurmeister) [15:52:39] 10Tool-documentation, 03WMA-Hackathon-2025: [Documentation] Write user docs for WikiBooster tool - https://phabricator.wikimedia.org/T390353#10704689 (10TBurmeister) [15:53:28] RESOLVED: WidespreadPuppetAgentFailure: Widespread puppet agent failures in project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadPuppetAgentFailure [15:53:51] 10Tool-documentation, 03WMA-Hackathon-2025: WMAHack25: Wikimedia Tool Documentation - https://phabricator.wikimedia.org/T390349#10704694 (10TBurmeister) [15:53:52] 10Tool-documentation, 03WMA-Hackathon-2025: [Documentation] Write user docs for WikiBooster tool - https://phabricator.wikimedia.org/T390353#10704693 (10TBurmeister) [15:54:29] 10Tool-documentation, 03WMA-Hackathon-2025: WMAHack25: Wikimedia Tool Documentation - https://phabricator.wikimedia.org/T390349#10704695 (10TBurmeister) [15:54:31] 10Tool-documentation, 03WMA-Hackathon-2025: [Documentation] Write user docs for Translat-a-thon tool - https://phabricator.wikimedia.org/T390387#10704696 (10TBurmeister) [15:55:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1005.eqiad.wmnet' (T381499) [15:55:31] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:01:26] 06cloud-services-team, 10Toolforge: Check for non-libre vscode-server installs/processes on Toolforge bastions - https://phabricator.wikimedia.org/T390885 (10bd808) 03NEW [16:01:33] PROBLEM - Host cloudservices1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:15] RECOVERY - Host cloudservices1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:02:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1005.eqiad.wmnet' (T381499) [16:02:27] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:02:30] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1006.eqiad.wmnet' (T381499) [16:03:16] 06cloud-services-team, 10Tool-suggestbotbn, 10Toolforge: Ssh login to `login.toolforge.org` failing for uid=shohag - https://phabricator.wikimedia.org/T390614#10704760 (10bd808) a:05fnegri→03taavi [16:03:59] PROBLEM - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:03:59] PROBLEM - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:03:59] PROBLEM - Bird Internet Routing Daemon on cloudservices1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:04:01] PROBLEM - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:03] PROBLEM - Check DNS auth via TCP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:23] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [16:04:33] PROBLEM - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:33] PROBLEM - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:47] PROBLEM - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:49] PROBLEM - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is CRITICAL: DNS CRITICAL - 7.034 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:49] RECOVERY - Check DNS auth via TCP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.030 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:50] RECOVERY - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.034 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:51] RECOVERY - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.044 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:53] RECOVERY - Check DNS auth via TCP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.044 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:04:59] RECOVERY - Bird Internet Routing Daemon on cloudservices1005 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:05:23] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudservices1005 is OK: OK: UP (pid=3805) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [16:05:23] RECOVERY - Check DNS auth via UDP of k8s.svc.tools.eqiad1.wikimedia.cloud on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.039 seconds response time (k8s.svc.tools.eqiad1.wikimedia.cloud. 300 IN A 172.16.6.113) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:05:23] RECOVERY - Check DNS auth via UDP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.072 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:05:37] RECOVERY - Check DNS auth via TCP of login.toolforge.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.037 seconds response time (login.toolforge.org. 3600 IN CNAME bastion.toolforge.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:05:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T381499) [16:05:41] RECOVERY - Check DNS auth via UDP of www.wmcloud.org on server ns0.openstack.eqiad1.wikimediacloud.org on cloudservices1005 is OK: DNS OK - 0.035 seconds response time (www.wmcloud.org. 3600 IN CNAME wmcloud.org.) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:05:43] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=97) on host 'cloudservices1006.eqiad.wmnet' (T381499) [16:05:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices1006.eqiad.wmnet' (T381499) [16:07:12] RESOLVED: SystemdUnitDown: The service unit labs-ip-alias-dump.service is in failed status on host cloudservices1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudservices1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:11:45] PROBLEM - Host cloudservices1006 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:12] FIRING: KernelErrors: Server cloudcontrol1005 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [16:12:24] 06cloud-services-team: KernelErrors Server cloudcontrol1005 logged kernel errors - https://phabricator.wikimedia.org/T390886 (10phaultfinder) 03NEW [16:13:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices1006.eqiad.wmnet' (T381499) [16:13:15] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:13:47] RECOVERY - Host cloudservices1006 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [16:15:07] PROBLEM - Bird Internet Routing Daemon on cloudservices1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:15:11] PROBLEM - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:15:21] PROBLEM - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:16:01] RECOVERY - Check DNS auth via UDP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.079 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:16:07] RECOVERY - Bird Internet Routing Daemon on cloudservices1006 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [16:16:11] RECOVERY - Check DNS auth via TCP of tools-puppetserver-01.tools.eqiad1.wikimedia.cloud on server ns1.openstack.eqiad1.wikimediacloud.org on cloudservices1006 is OK: DNS OK - 0.027 seconds response time (tools-puppetserver-01.tools.eqiad1.wikimedia.cloud. 60 IN A 172.16.3.13) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:18:10] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:19:00] FIRING: [12x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:21:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1006.eqiad.wmnet' (T381499) [16:21:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [16:21:28] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:22:10] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.tofu (exit_code=97) running tofu plan+apply for main branch [16:22:12] FIRING: NodeDown: Node cloudcontrol1006 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [16:22:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [16:23:10] RESOLVED: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:24:00] FIRING: [12x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:24:00] RESOLVED: NodeDown: Node cloudcontrol1006 is down. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NodeDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DNodeDown [16:24:00] FIRING: KernelErrors: Server cloudcontrol1006 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [16:24:06] 06cloud-services-team: KernelErrors Server cloudcontrol1006 logged kernel errors - https://phabricator.wikimedia.org/T390889 (10phaultfinder) 03NEW [16:27:12] RESOLVED: [12x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:30:40] FIRING: GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:32:12] FIRING: [26x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:32:16] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1005.eqiad.wmnet' (T381499) [16:32:22] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:32:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol1007.eqiad.wmnet' (T381499) [16:34:00] FIRING: [2x] KernelErrors: Server cloudcontrol1005 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [16:34:05] FIRING: [26x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:34:16] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T390893 (10phaultfinder) 03NEW [16:35:40] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:36:39] 10Tool-documentation, 03WMA-Hackathon-2025: WMAHack25: Wikimedia Tool Documentation - https://phabricator.wikimedia.org/T390349#10704990 (10TBurmeister) Summary of activities: Participants created draft technical documentation using the [[ https://www.mediawiki.org/wiki/Documentation/Tool_doc_template | Tool... [16:37:11] 10Tool-documentation, 03WMA-Hackathon-2025: WMAHack25: Wikimedia Tool Documentation - https://phabricator.wikimedia.org/T390349#10704994 (10TBurmeister) 05Open→03In progress a:03TBurmeister [16:37:12] RESOLVED: [26x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:47:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1031.eqiad.wmnet' (T381499) [16:47:28] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:47:40] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:49:00] FIRING: [28x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:52:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol1007.eqiad.wmnet' (T381499) [16:52:37] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [16:52:40] FIRING: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:52:55] FIRING: [3x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:53:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [16:53:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.unset_maintenance (1133432) [16:54:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1031.eqiad.wmnet' (T381499) [16:54:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1032.eqiad.wmnet' (T381499) [16:55:31] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudweb.unset_maintenance (exit_code=0) (1133432) [16:56:52] RESOLVED: [14x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [16:57:12] FIRING: [3x] KernelErrors: Server cloudcontrol1005 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [16:57:17] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T390893#10705108 (10phaultfinder) [16:57:40] RESOLVED: [2x] GaleraClusterSizeMismatch: Galera in eqiad1 has 2 nodes - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/GaleraClusterSizeMismatch - https://grafana.wikimedia.org/d/galera-cluster-summary/wmcs-openstack-eqiad-galera-cluster-summary - https://alerts.wikimedia.org/?q=alertname%3DGaleraClusterSizeMismatch [16:58:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-55 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [16:59:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1032.eqiad.wmnet' (T381499) [16:59:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1033.eqiad.wmnet' (T381499) [16:59:54] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:05:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1033.eqiad.wmnet' (T381499) [17:05:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1034.eqiad.wmnet' (T381499) [17:05:54] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:11:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1034.eqiad.wmnet' (T381499) [17:11:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1035.eqiad.wmnet' (T381499) [17:11:44] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:17:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1035.eqiad.wmnet' (T381499) [17:17:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1036.eqiad.wmnet' (T381499) [17:17:24] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:23:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1036.eqiad.wmnet' (T381499) [17:23:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1037.eqiad.wmnet' (T381499) [17:23:36] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:30:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1037.eqiad.wmnet' (T381499) [17:30:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1038.eqiad.wmnet' (T381499) [17:30:12] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:33:28] 10Tool-documentation, 03WMA-Hackathon-2025: [Documentation] Write user docs for A search engine for translations from the English Wiktionary - https://phabricator.wikimedia.org/T390456#10705255 (10Erutuon) Unfortunately this tool is out of date. It is showing translations from the versions of English Wikti... [17:36:34] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1038.eqiad.wmnet' (T381499) [17:36:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1039.eqiad.wmnet' (T381499) [17:36:40] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:42:25] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1039.eqiad.wmnet' (T381499) [17:42:27] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1040.eqiad.wmnet' (T381499) [17:42:32] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:46:30] 14Grid-Engine-to-K8s-Migration: Migrate enwikt-translations from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319724#10705317 (10Erutuon) > Yes, currently every build happens from scratch, so it takes a few minutes sometimes to build, we will add some caching and such in... [17:48:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1040.eqiad.wmnet' (T381499) [17:48:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1041.eqiad.wmnet' (T381499) [17:48:54] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [17:49:17] 10Tool-documentation, 03WMA-Hackathon-2025: [Documentation] Write user docs for A search engine for translations from the English Wiktionary - https://phabricator.wikimedia.org/T390456#10705341 (10Accuratecy051) Oh Though it was stated not actively maintained, but what can be done to revitalize the tool... [17:50:23] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudbackup1003.eqiad.wmnet' (T381499) [17:55:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1041.eqiad.wmnet' (T381499) [17:55:18] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1042.eqiad.wmnet' (T381499) [17:55:21] 06cloud-services-team, 10Toolforge: Check for non-libre vscode-server installs/processes on Toolforge bastions - https://phabricator.wikimedia.org/T390885#10705377 (10bd808) I don't want to put any of these folks on blast as I expect most if not all of them never thought about the crayon license that Micro$oft... [17:55:21] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [18:00:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudbackup1003.eqiad.wmnet' (T381499) [18:00:37] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [18:00:43] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudbackup1004.eqiad.wmnet' (T381499) [18:02:43] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1042.eqiad.wmnet' (T381499) [18:02:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1043.eqiad.wmnet' (T381499) [18:09:18] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1043.eqiad.wmnet' (T381499) [18:09:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1044.eqiad.wmnet' (T381499) [18:11:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudbackup1004.eqiad.wmnet' (T381499) [18:11:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudbackup2003.eqiad.wmnet' (T381499) [18:11:16] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudbackup2003.eqiad.wmnet' (T381499) [18:14:46] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudbackup2003.codfw.wmnet' (T381499) [18:15:54] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1044.eqiad.wmnet' (T381499) [18:15:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1045.eqiad.wmnet' (T381499) [18:19:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:21:14] PROBLEM - Host cloudbackup2003 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:04] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1045.eqiad.wmnet' (T381499) [18:22:07] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1046.eqiad.wmnet' (T381499) [18:24:33] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudbackup2003.codfw.wmnet' (T381499) [18:24:44] RECOVERY - Host cloudbackup2003 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [18:28:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1046.eqiad.wmnet' (T381499) [18:28:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1047.eqiad.wmnet' (T381499) [18:29:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudbackup2004.codfw.wmnet' (T381499) [18:29:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:33:01] (03open) 10naorleizer: fixed searching endpoint [toolforge-repos/miss-search] - 10https://gitlab.wikimedia.org/toolforge-repos/miss-search/-/merge_requests/2 [18:33:09] (03merge) 10naorleizer: fixed searching endpoint [toolforge-repos/miss-search] - 10https://gitlab.wikimedia.org/toolforge-repos/miss-search/-/merge_requests/2 [18:34:56] RESOLVED: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:34:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1047.eqiad.wmnet' (T381499) [18:34:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1048.eqiad.wmnet' (T381499) [18:35:04] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [18:35:10] PROBLEM - Host cloudbackup2004 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:22] FIRING: HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:38:36] RECOVERY - Host cloudbackup2004 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [18:38:39] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudbackup2004.codfw.wmnet' (T381499) [18:41:22] RESOLVED: [3x] HAProxyBackendUnavailable: HAProxy service designate-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [18:42:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1048.eqiad.wmnet' (T381499) [18:42:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1049.eqiad.wmnet' (T381499) [18:42:24] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [18:43:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:44:56] FIRING: [2x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:49:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1049.eqiad.wmnet' (T381499) [18:49:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1050.eqiad.wmnet' (T381499) [18:49:13] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [18:49:56] FIRING: [3x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:56:40] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1050.eqiad.wmnet' (T381499) [18:56:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1051.eqiad.wmnet' (T381499) [18:56:48] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [18:59:56] RESOLVED: [3x] SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1005. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:03:14] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1051.eqiad.wmnet' (T381499) [19:03:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1052.eqiad.wmnet' (T381499) [19:03:20] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:03:41] FIRING: CloudVPSDesignateLeaks: Detected 9 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:03:59] 10Tool-fault-tolerance: Low priority: new elastic hosts not showing in web UI - https://phabricator.wikimedia.org/T390902 (10bking) 03NEW [19:08:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:10:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1052.eqiad.wmnet' (T381499) [19:10:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1053.eqiad.wmnet' (T381499) [19:10:29] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:17:20] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1053.eqiad.wmnet' (T381499) [19:17:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1054.eqiad.wmnet' (T381499) [19:17:28] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:24:09] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1054.eqiad.wmnet' (T381499) [19:24:11] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1055.eqiad.wmnet' (T381499) [19:24:17] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:28:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:30:32] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1055.eqiad.wmnet' (T381499) [19:30:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1056.eqiad.wmnet' (T381499) [19:30:38] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:33:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [19:36:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1005.eqiad.wmnet' (T381499) [19:37:04] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:37:42] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1056.eqiad.wmnet' (T381499) [19:37:44] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1057.eqiad.wmnet' (T381499) [19:39:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [19:39:20] 06cloud-services-team: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T390906 (10phaultfinder) 03NEW [19:43:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:43:34] FIRING: DiskSpace: Disk space cloudcontrol1005:9100:/ 3.073% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:44:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [19:44:19] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1057.eqiad.wmnet' (T381499) [19:44:21] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1058.eqiad.wmnet' (T381499) [19:44:25] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [19:45:34] PROBLEM - Host cloudnet1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:24] RECOVERY - Host cloudnet1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:46:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet1005.eqiad.wmnet' (T381499) [19:48:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1006.eqiad.wmnet' (T381499) [19:48:34] RESOLVED: DiskSpace: Disk space cloudcontrol1005:9100:/ 3.162% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:12:49] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1061.eqiad.wmnet' (T381499) [20:12:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1062.eqiad.wmnet' (T381499) [20:12:55] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:13:41] RESOLVED: CloudVPSDesignateLeaks: Detected 9 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:17:31] 06cloud-services-team, 10Toolforge: [webservice,toolforge-cli] Make `webservice shell` a standalone tool - https://phabricator.wikimedia.org/T311917#10705931 (10bd808) >>! In T388091#10611511, @bd808 wrote: > One of the features of `webservice shell` that I worked pretty hard to establish is that you can creat... [20:17:49] RESOLVED: [4x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1006 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:19:58] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-55 [20:20:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1062.eqiad.wmnet' (T381499) [20:20:09] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1063.eqiad.wmnet' (T381499) [20:20:13] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:26:37] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1063.eqiad.wmnet' (T381499) [20:26:40] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1064.eqiad.wmnet' (T381499) [20:26:44] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:30:26] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-68, tools-k8s-worker-nfs-55 [20:33:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:33:38] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1064.eqiad.wmnet' (T381499) [20:33:39] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1065.eqiad.wmnet' (T381499) [20:33:44] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:34:01] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885#10705953 (10bd808) >>! In T389885#10702223, @Legoktm wrote: > Also, is "puppet groups" a good label for this? Is there a better term? >>! In T389885#10702404, @taavi wrote: > I've often used... [20:41:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1065.eqiad.wmnet' (T381499) [20:41:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1066.eqiad.wmnet' (T381499) [20:41:19] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:43:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:44:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:48:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:48:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:49:36] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1066.eqiad.wmnet' (T381499) [20:49:38] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1067.eqiad.wmnet' (T381499) [20:49:42] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:56:50] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1067.eqiad.wmnet' (T381499) [20:56:57] T381499: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499 [20:58:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:03:18] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:18:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:28:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:39:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:44:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:45:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:47:41] 06cloud-services-team, 10Cloud-VPS: Upgrade cloud-vps openstack to version 'Epoxy' - https://phabricator.wikimedia.org/T390914 (10Andrew) 03NEW [21:47:46] 06cloud-services-team, 10Cloud-VPS: Upgrade cloud-vps openstack to version 'Epoxy' - https://phabricator.wikimedia.org/T390914#10706158 (10Andrew) p:05Triage→03Medium [21:47:53] 06cloud-services-team, 10Cloud-VPS: Upgrade cloud-vps openstack to version 'Epoxy' - https://phabricator.wikimedia.org/T390914#10706159 (10Andrew) 05Open→03Stalled [21:48:31] 06cloud-services-team, 10Cloud-VPS: Upgrade cloud-vps openstack to version 'Dalmatian' - https://phabricator.wikimedia.org/T381499#10706161 (10Andrew) 05Open→03Resolved [21:49:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:50:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:52:49] 10Openstack-Magnum: Magnum UI should offer full kube config - https://phabricator.wikimedia.org/T343362#10706173 (10Andrew) p:05Triage→03Medium [21:54:26] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:58:01] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10706184 (10Andrew) @fgiunchedi are you already using a puppetless base image? If not, would you like to? I'm mixed on the idea of deferring package updates. Most of the time this will b... [22:18:49] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10706223 (10Andrew) Wow! I thought the difference was going to be negligible and that puppet runs were somehow included in that 100 seconds, but nope! With package_update: true/package_u... [22:21:04] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10706227 (10taavi) One option to speed up the upgrade part would be to refresh the base image more frequently, like after every Debian point release or so. [22:29:45] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10706233 (10Andrew) With package_update: true/package_upgrade: false: ` 34.71100s (init-local/search-OpenStackLocal) 30.67500s (modules-config/config-apt-configure) 30.040... [22:32:27] 06cloud-services-team, 10Cloud-VPS: Options/thoughts for faster VM provisioning - https://phabricator.wikimedia.org/T390822#10706239 (10Andrew) >>! In T390822#10706227, @taavi wrote: > One option to speed up the upgrade part would be to refresh the base image more frequently, like after every Debian point rele... [22:43:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:46:50] 06cloud-services-team, 10PAWS: Duplicate eqiad1 PAWS setup in codfw1dev - https://phabricator.wikimedia.org/T390726#10706276 (10Andrew) 05Open→03Resolved [23:00:03] (03PS1) 10Bovimacoco: T386326 Remove duplicate routes copy file bug=T386326 [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1133572 [23:04:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-13 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:31:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [23:36:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [23:39:22] (03PS1) 10Bovimacoco: T386331 Remove unrequired packages from requirements Bug=T386331 [labs/tools/wdaudiolex-be] - 10https://gerrit.wikimedia.org/r/1133577