[01:03:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:13:05] 06Toolforge-standards-committee, 10video2commons: Write-access to Video2Commons GitHub repo - https://phabricator.wikimedia.org/T394802#11227570 (10Soda) @Pintoch, I do think that makes sense. [02:33:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [02:43:15] (03update) 10samwilson: Migrate from GitHub to GitLab CI [toolforge-repos/svgtranslate] - 10https://gitlab.wikimedia.org/toolforge-repos/svgtranslate/-/merge_requests/1 (https://phabricator.wikimedia.org/T402505) [02:49:15] (03update) 10samwilson: Migrate from GitHub to GitLab CI [toolforge-repos/svgtranslate] - 10https://gitlab.wikimedia.org/toolforge-repos/svgtranslate/-/merge_requests/1 (https://phabricator.wikimedia.org/T402505) [02:53:27] (03merge) 10samwilson: Migrate from GitHub to GitLab CI [toolforge-repos/svgtranslate] - 10https://gitlab.wikimedia.org/toolforge-repos/svgtranslate/-/merge_requests/1 (https://phabricator.wikimedia.org/T402505) [03:11:14] 06cloud-services-team, 10Toolforge: Dotnet bots failing with no logs - https://phabricator.wikimedia.org/T403927#11227721 (10Hawkeye7) (1) Yes, I removed the output and error logging to get the jobs working again. It does indeed seem to have been the problem. Possibly the path is not defined? Does it work for... [03:23:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:18:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [05:33:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:38:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [06:47:17] !log tools.cluebotng-trainer Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/18121221077 (https://github.com/cluebotng/component-configs/commits/11eb3da0c88160bca0ec60e55dadad7e928e40f3) [06:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-trainer/SAL [07:56:51] 06cloud-services-team, 10Toolforge: Jobs failing with no logs - https://phabricator.wikimedia.org/T403927#11227970 (10dcaro) [07:57:05] 06cloud-services-team, 10Toolforge: Jobs failing with no logs - https://phabricator.wikimedia.org/T403927#11227972 (10dcaro) > (1) Yes, I removed the output and error logging to get the jobs working again. It does indeed seem to have been the problem. Possibly the path is not defined? Probably it was an issu... [07:58:28] 06cloud-services-team, 10Toolforge: [jobs-cli] `job_type` not handled in `dump`, generates warning per job - https://phabricator.wikimedia.org/T405786#11227975 (10dcaro) Yep, we are introducing a new field `job_type` to explicitly state which type of job you are defining and it's not yet integrated in the cli... [08:02:52] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): toolforge logs appears to suffer from intermittent latency - https://phabricator.wikimedia.org/T402736#11227978 (10dcaro) I see two issues: * the rate limiting prevents us from sending more than 100 logs at a time, and the current policy in alloy co... [08:06:26] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-67 [08:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:08:31] (03update) 10dcaro: [status] make job status an enum, with clearly defined states [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/208 (https://phabricator.wikimedia.org/T401172) (owner: 10raymond-ndibe) [08:18:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [08:19:25] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-67 [08:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:23:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has some processes stuck on NFS - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [08:42:52] 10Toolforge (Quota-requests): Request increased build quota for cluebotng Toolforge tool - https://phabricator.wikimedia.org/T405645#11228044 (10dcaro) +1 [08:42:56] 10Toolforge (Quota-requests): Request increased build quota for cluebotng-monitoring Toolforge tool - https://phabricator.wikimedia.org/T405644#11228046 (10dcaro) +1 [08:43:00] 10Toolforge (Quota-requests): Request increased build quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T405643#11228048 (10dcaro) +1 [08:47:38] (03update) 10taavi: maintain-harbor: Bump quotas for Cluebot related tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/982 (https://phabricator.wikimedia.org/T405643 https://phabricator.wikimedia.org/T405644 https://phabricator.wikimedia.org/T405645) [08:47:41] (03open) 10taavi: maintain-harbor: Bump quotas for Cluebot related tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/982 (https://phabricator.wikimedia.org/T405643 https://phabricator.wikimedia.org/T405644 https://phabricator.wikimedia.org/T405645) [08:49:40] (03approved) 10dcaro: maintain-harbor: Bump quotas for Cluebot related tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/982 (https://phabricator.wikimedia.org/T405643 https://phabricator.wikimedia.org/T405644 https://phabricator.wikimedia.org/T405645) (owner: 10taavi) [08:50:54] (03merge) 10taavi: maintain-harbor: Bump quotas for Cluebot related tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/982 (https://phabricator.wikimedia.org/T405643 https://phabricator.wikimedia.org/T405644 https://phabricator.wikimedia.org/T405645) [08:51:13] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [08:51:25] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [08:53:33] (03open) 10taavi: maintain-harbor: Fix name of cluebotng tool [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/983 (https://phabricator.wikimedia.org/T405645) [08:53:34] (03update) 10taavi: maintain-harbor: Fix name of cluebotng tool [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/983 (https://phabricator.wikimedia.org/T405645) [08:54:12] (03merge) 10taavi: maintain-harbor: Fix name of cluebotng tool [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/983 (https://phabricator.wikimedia.org/T405645) [08:54:12] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-harbor [08:54:23] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-harbor [08:55:29] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased build quota for cluebotng Toolforge tool - https://phabricator.wikimedia.org/T405645#11228081 (10taavi) 05Open→03Resolved a:03taavi [08:55:29] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased build quota for cluebotng-monitoring Toolforge tool - https://phabricator.wikimedia.org/T405644#11228084 (10taavi) 05Open→03Resolved a:03taavi [08:55:36] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased build quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T405643#11228087 (10taavi) 05Open→03Resolved a:03taavi [09:26:34] 10Tool-global-search: Add filter by wiki(s) - https://phabricator.wikimedia.org/T406007 (10StanProg) 03NEW [09:41:02] 06cloud-services-team, 10Toolforge: Rate limiting errors should not trigger ToolforgeWebHighErrorRate - https://phabricator.wikimedia.org/T406010 (10fnegri) 03NEW [09:41:55] 06cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): 2025-09-28 ToolforgeWebHighErrorRate: High 5xx rate on Toolforge web services - https://phabricator.wikimedia.org/T405850#11228408 (10fnegri) 05In progress→03Resolved a:03fnegri I will mark this task as Resolved as the alert... [10:27:47] 06cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24): [tools,nfs,infra] Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11228634 (10dcaro) A note on the behavior of workers stuck on NFS, some of them might get out of it by themselves w... [10:59:28] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11228760 (10A_smart_kitten) Random thought: if fixing the email-sending issue is proving to be a challenge, maybe we could mitigate this by... [11:38:17] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [builds-api] does not correctly resolve `ref` - builds random things - https://phabricator.wikimedia.org/T405829#11228956 (10DamianZaremba) > we should also be checking for an empty string as the sign for the ref not existing That is essentially wha... [12:02:11] (03approved) 10fnegri: build: return error when the given ref is not resolvable [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/145 (owner: 10dcaro) [13:23:42] 10wikitech.wikimedia.org, 06serviceops-radar, 06SRE, 13Patch-For-Review, 07SRE-Unowned: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#11229375 (10hashar) //I am pasting comments I have made on a Slack thread:// We had [[ https://www.mediawiki.org/wiki/Extension:DumpHTML | Extensio... [14:00:26] (03open) 10ahecht: Merge from main branch [toolforge-repos/afdstats] (testing) - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/4 [14:06:25] (03update) 10ahecht: Draft: Cache database queries [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/3 [14:08:07] (03update) 10ahecht: Draft: Cache database queries [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/3 [16:24:45] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (https://phabricator.wikimedia.org/T359650) (owner: 10raymond-ndibe) [16:25:30] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [16:26:27] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11230175 (10Dzahn) > a task to request account activation is a bit of a hurdle to get into the test instance [and requires SRE time To be... [16:26:43] 06cloud-services-team, 10Cloud-VPS: unable to "apt install helmfile" on CloudVPS debian 13 vm - https://phabricator.wikimedia.org/T405970#11230179 (10Andrew) ` root@apt1002:~# reprepro ls helm helm | 2.17.0-1 | buster-wikimedia | amd64, source helm | 3.17.2-1 | bookworm-wikimedia | amd64 helm | 3.18.6-1 | bo... [16:28:08] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [16:31:22] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [16:31:50] 06cloud-services-team, 10Cloud-VPS: unable to "apt install helmfile" on CloudVPS debian 13 vm - https://phabricator.wikimedia.org/T405970#11230205 (10Andrew) Well, that mystery aside, the actual issue is resolved with ` root@apt1002:~# reprepro copy trixie-wikimedia bookworm-wikimedia helm3 Exporting indic... [16:32:59] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11230217 (10Pppery) There are two separate issues being conflated here: - Issue 1: Should some human have to manually approve accounts on... [16:41:47] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [16:45:07] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [16:55:57] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [17:17:34] (03update) 10dcaro: [jobs-api] save business models in a DB [repos/cloud/toolforge/jobs-api] (save_business_models_to_db) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/114 (owner: 10raymond-ndibe) [17:36:19] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11230514 (10Dzahn) The production instance is connected to other systems handling the user sign-up. MediaWiki/SUL and LDAP/developer account... [19:23:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [19:45:00] PROBLEM - Host cloudvirt1063 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:35] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1063.eqiad.wmnet}' [19:46:36] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1064.eqiad.wmnet}' [19:46:40] RECOVERY - Host cloudvirt1063 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [20:05:10] PROBLEM - Host cloudvirt1064 is DOWN: PING CRITICAL - Packet loss = 100% [20:05:51] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1064.eqiad.wmnet}' [20:05:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1065.eqiad.wmnet}' [20:06:08] RECOVERY - Host cloudvirt1064 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:25:33] PROBLEM - Host cloudvirt1065 is DOWN: PING CRITICAL - Packet loss = 100% [20:26:59] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11231111 (10A_smart_kitten) >>! In T388022#11230217, @Pppery wrote: > There are two separate issues being conflated here: > > - Issue 1: S... [20:27:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1065.eqiad.wmnet}' [20:27:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1066.eqiad.wmnet}' [20:27:10] RECOVERY - Host cloudvirt1065 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:31:57] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11231131 (10A_smart_kitten) >>! In T388022#11230175, @Dzahn wrote: > To be honest the SRE time required to manually activate some phab users... [20:42:40] PROBLEM - Host cloudvirt1066 is DOWN: PING CRITICAL - Packet loss = 100% [20:43:58] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1066.eqiad.wmnet}' [20:43:59] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' [20:44:08] RECOVERY - Host cloudvirt1066 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [20:45:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1066 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:50:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1066 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:59:47] 10VPS-project-Phabricator, 06collaboration-services, 06Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11231245 (... [21:00:29] 10VPS-project-Phabricator, 06collaboration-services, 06Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11231246 (... [21:01:06] 10VPS-project-Phabricator, 06collaboration-services, 06Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11231249 (... [21:03:34] PROBLEM - Host cloudvirt1067 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1067.eqiad.wmnet}' [21:04:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1068.eqiad.wmnet}' [21:05:10] RECOVERY - Host cloudvirt1067 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [21:05:50] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1067 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:10:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1067 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:17:28] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11231303 (10Dzahn) I don't mind if you do that. That would give us an idea how common it is (or will become). Taking a step back though: I... [21:20:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1068.eqiad.wmnet}' [21:20:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1069.eqiad.wmnet}' [21:20:06] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11231310 (10Dzahn) Seems like this is only in the phabricator config itself after all.. or something was fixed meanwhile since this ticket w... [21:23:19] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1067 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:25:34] RESOLVED: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1067 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:39:40] PROBLEM - Host cloudvirt1069 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1069.eqiad.wmnet}' [21:40:50] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1070.eqiad.wmnet}' [21:41:08] RECOVERY - Host cloudvirt1069 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [21:42:04] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1069 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:43:19] FIRING: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1068 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:44:00] 10VPS-project-Phabricator, 06collaboration-services: Phabricator test project requires email verification but can't send email - https://phabricator.wikimedia.org/T388022#11231344 (10Pppery) It hasn't been fixed. I asked phabricator.wmcloud.org to resend me a verification email and it didn't arrive. [21:47:04] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1069 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:57:54] PROBLEM - Host cloudvirt1070 is DOWN: PING CRITICAL - Packet loss = 100% [21:58:46] RECOVERY - Host cloudvirt1070 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [21:59:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1070.eqiad.wmnet}' [21:59:02] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1071.eqiad.wmnet}' [22:01:42] PROBLEM - Host cloudvirt1071 is DOWN: PING CRITICAL - Packet loss = 100% [22:02:56] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1071.eqiad.wmnet}' [22:02:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1072.eqiad.wmnet}' [22:03:10] RECOVERY - Host cloudvirt1071 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [22:19:41] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.eqiad.wmnet}' [22:20:34] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt2006-dev.eqiad.wmnet}' [22:20:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [22:21:58] PROBLEM - Host cloudvirt1072 is DOWN: PING CRITICAL - Packet loss = 100% [22:22:28] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1072.eqiad.wmnet}' [22:22:29] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1073.eqiad.wmnet}' [22:23:08] RECOVERY - Host cloudvirt1072 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [22:25:40] PROBLEM - Host cloudvirt1073 is DOWN: PING CRITICAL - Packet loss = 100% [22:26:08] RECOVERY - Host cloudvirt1073 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [22:26:08] PROBLEM - ensure kvm processes are running on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:26:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1073.eqiad.wmnet}' [22:26:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1074.eqiad.wmnet}' [22:27:08] RECOVERY - ensure kvm processes are running on cloudvirt1073 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:29:48] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt2006-dev.codfw.wmnet}' [22:29:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt2005-dev.codfw.wmnet}' [22:41:16] PROBLEM - Host cloudvirt1074 is DOWN: PING CRITICAL - Packet loss = 100% [22:43:28] RECOVERY - Host cloudvirt1074 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [22:43:34] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1074.eqiad.wmnet}' [22:43:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1075.eqiad.wmnet}' [22:45:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1074 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [22:50:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1074 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [22:54:40] andrew@cloudcumin1001 safe_reboot (PID 485336) is awaiting input [23:00:38] PROBLEM - Host cloudvirt1075 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:40] 10Tool-nlwikibots: Dutch AfD reporter (tbpmelder) sometimes uses incorrect anchor tag - https://phabricator.wikimedia.org/T406081 (10FrankGeerlings) 03NEW [23:01:08] RECOVERY - Host cloudvirt1075 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [23:01:08] PROBLEM - ensure kvm processes are running on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:01:15] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1075.eqiad.wmnet}' [23:01:16] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1076.eqiad.wmnet}' [23:01:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1075 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [23:02:08] RECOVERY - ensure kvm processes are running on cloudvirt1075 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [23:03:37] 10Tool-nlwikibots: Dutch AfD reporter (tbpmelder) sometimes uses incorrect anchor tag - https://phabricator.wikimedia.org/T406081#11231518 (10FrankGeerlings) a:03FrankGeerlings [23:06:49] RESOLVED: [2x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1074 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [23:10:14] 10Tool-nlwikibots: Dutch AfD reporter (tbpmelder) sometimes uses incorrect anchor tag - https://phabricator.wikimedia.org/T406081#11231526 (10FrankGeerlings) p:05Triage→03Low This could be a duplicate of T224622 but that needs to be investigated. [23:19:38] PROBLEM - Host cloudvirt1076 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:29] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=0) on hosts matched by 'D{cloudvirt1076.eqiad.wmnet}' [23:20:40] RECOVERY - Host cloudvirt1076 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [23:21:49] FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1076 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [23:26:49] RESOLVED: NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudvirt1076 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [23:52:36] andrew@cloudcumin1001 safe_reboot (PID 485336) is awaiting input