[03:58:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:48:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [07:38:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [09:26:54] 10Cloud-VPS (Project-requests): Request creation of antivandalismai VPS project - https://phabricator.wikimedia.org/T388151#10666730 (10dcaro) Hi! Before creating the project, a couple notes/questions: * Are you going to use already existing models? If so, note that many are not open source, so they will not b... [09:32:03] 10Cloud-VPS (Project-requests): Request creation of futureaudiences VPS project - https://phabricator.wikimedia.org/T389158#10666762 (10dcaro) >>! In T389158#10650049, @derenrich wrote: > Yes I had read that before making this application but it sounded those were not necessarily hard rules. > > Given FA's vel... [09:37:26] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: Build a dev/testing environment for webservice that would make it easier to get people involved in fixes - https://phabricator.wikimedia.org/T220053#10666814 (10dcaro) 05Open→03Resolved a:03dcaro I think we can close this, as it's most... [09:40:09] 06cloud-services-team, 10Cloud-VPS: So many wmcs eqiad alerts - https://phabricator.wikimedia.org/T389672#10666837 (10aborrero) 05Open→03Resolved [09:40:26] 06cloud-services-team, 10Toolforge: WMCS FY22/23 Q3: next steps in grid engine deprecation - https://phabricator.wikimedia.org/T327254#10666839 (10dcaro) It's been a couple years since the Q3 of 2023/24, keeping the leftover children task open in case we want to still pick it up at some point. [09:59:29] 06cloud-services-team, 10Toolforge: dev.toolforge.org unreachable - https://phabricator.wikimedia.org/T389717#10666899 (10dcaro) As far as I can see, there's been some flakiness with LDAP (this are the `sssd_wikimedia.org.log*` files): ` root@tools-bastion-12:~# grep --no-filename "Can't contact LDAP server" /... [09:59:37] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-72 [09:59:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:05:23] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-72 [10:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:52:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-72 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:54:36] 10Tool-refill: Stuck on "Waiting for an available worker" - https://phabricator.wikimedia.org/T389775 (10TheresNoTime) 03NEW [10:55:14] 10Tool-refill: Stuck on "Waiting for an available worker" - https://phabricator.wikimedia.org/T389775#10667177 (10TheresNoTime) [What I've looked at so far] Starting on `[samtar@tools-bastion-13 ~ (main|u=)]$ become refill-api` ` tools.refill-api@tools-bastion-13:~$ kubectl get pods NAME... [10:59:49] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan+apply for main branch [11:00:34] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan+apply for main branch [11:00:42] !log aborrero@cloudcumin1001 admin START - Cookbook wmcs.openstack.tofu running tofu plan for main branch [11:01:17] !log aborrero@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.tofu (exit_code=0) running tofu plan for main branch [11:54:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: tofu-infra: refactor repo structure - https://phabricator.wikimedia.org/T375283#10667494 (10aborrero) 05Stalled→03Resolved [12:06:51] 10Tool-refill: Stuck on "Waiting for an available worker" - https://phabricator.wikimedia.org/T389775#10667526 (10dcaro) Hey @TheresNoTime! We restricted lately the setting of the nfs mounts to only allow the controller we have to do it itself, so I see that you are manually deploying a custom deployment k8s ob... [12:21:40] 10Tool-refill: Stuck on "Waiting for an available worker" - https://phabricator.wikimedia.org/T389775#10667609 (10TheresNoTime) >>! In T389775#10667526, @dcaro wrote: > Hey @TheresNoTime! > > We restricted lately the setting of the nfs mounts to only allow the controller we have to do it itself, so I see that y... [12:26:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:27:00] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol1005:9100 - https://phabricator.wikimedia.org/T389793 (10phaultfinder) 03NEW [12:38:41] (03PS2) 10Chuckonwumelu: Add Chuck key [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 [12:39:38] (03CR) 10Chuckonwumelu: "Expanded on commit message" [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:12:36] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10667843 (10dcaro) 05Open→03Resolved >>! In T388965#10656172, @taavi wrote: > Anything left to do here?... [13:14:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:14:05] 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [jobs-api] Save business models in a DB - https://phabricator.wikimedia.org/T359650#10667847 (10dcaro) [13:14:07] 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#10667848 (10dcaro) [13:14:14] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [jobs-api] Refactor before webservice support - https://phabricator.wikimedia.org/T359804#10667849 (10dcaro) [13:14:47] 06cloud-services-team, 10Toolforge: [jobs-api] separate jobs-framework k8s object templates from code - https://phabricator.wikimedia.org/T358815#10667853 (10dcaro) [13:14:56] 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#10667854 (10dcaro) [13:15:05] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [jobs-api] Refactor before webservice support - https://phabricator.wikimedia.org/T359804#10667855 (10dcaro) [13:15:42] 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [jobs-api] Split the API, business, and k8s models - https://phabricator.wikimedia.org/T359808#10667860 (10dcaro) It's ongoing and on the top of the list [13:24:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-2 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:26:52] 06cloud-services-team, 10Toolforge: Analyze Toolforge and Toolsbeta for Virtual Resource Usage - https://phabricator.wikimedia.org/T389081#10667915 (10Chuckonwumelu) [13:32:25] 10Tool-refill: Stuck on "Waiting for an available worker" - https://phabricator.wikimedia.org/T389775#10667951 (10Curb_Safe_Charmer) 05Open→03Resolved a:03Curb_Safe_Charmer I can confirm that is working now, thanks TNT. Anything I can use in my procedure if / when it reoccurs? [13:33:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:34:04] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] Add Chuck key [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:35:09] (03CR) 10Arturo Borrero Gonzalez: [V:03+2 C:03+2] Add Chuck key [labs/private] - 10https://gerrit.wikimedia.org/r/1129595 (owner: 10Chuckonwumelu) [13:47:30] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10668073 (10aborrero) [13:48:18] RESOLVED: PuppetFailure: Puppet has failed on cloudcontrol1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:57:37] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10668145 (10aborrero) [13:59:15] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10668160 (10Chuckonwumelu) [14:00:52] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10668172 (10aborrero) [14:28:03] 10Tool-refill: Stuck on "Waiting for an available worker" - https://phabricator.wikimedia.org/T389775#10668441 (10dcaro) The best way of avoiding this issues is to use the toolforge clis/apis instead of k8s directly, as we will maintain the backwards compatibility for those (we change it, but always announce... [14:31:15] 06cloud-services-team, 10Toolforge: [builds-api,builds-service,builds-cli] toolforge build --envvar does not accept values containing equals character - https://phabricator.wikimedia.org/T389694#10668455 (10dcaro) [14:31:32] 06cloud-services-team, 10Toolforge: [builds-api,builds-service,builds-cli] toolforge build --envvar does not accept values containing equals character - https://phabricator.wikimedia.org/T389694#10668458 (10dcaro) p:05Triage→03Medium [15:10:10] 10Tools, 10Wikidata, 07Security: Blocked Wikidata user sockpuppets are doing automated misconduct with QuickStatements - https://phabricator.wikimedia.org/T386978#10668792 (10Wustenspringmaus) As admins on WD, we still can't stop the batches. [15:35:39] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10669027 (10Chuckonwumelu) [15:36:59] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10669041 (10Chuckonwumelu) [15:37:36] 06cloud-services-team: Onboard Chuck Onwumelu - https://phabricator.wikimedia.org/T386715#10669056 (10Chuckonwumelu) [17:08:34] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:13:19] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [17:17:19] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:18:00] 10Cloud Services Proposals, 06cloud-services-team, 10Cloud-VPS: Decision Request - How openstack projects relate to tofu-infra - https://phabricator.wikimedia.org/T385604#10669992 (10dcaro) It seems this is waiting for my vote, so though I still consider the pros and cons of both options incomplete, **I lea... [17:27:08] 06cloud-services-team, 10Toolforge (Toolforge iteration 19), 13Patch-For-Review: [components-api] Add "runs" section to the deployment structure - https://phabricator.wikimedia.org/T389339#10670105 (10dcaro) 05Open→03In progress [17:27:14] 10Toolforge (Toolforge iteration 19): [components-api] restrict running deplpoyments to 1 - https://phabricator.wikimedia.org/T388643#10670108 (10dcaro) a:03dcaro [17:27:18] 10Toolforge (Toolforge iteration 19): [components-api] restrict running deplpoyments to 1 - https://phabricator.wikimedia.org/T388643#10670110 (10dcaro) 05Open→03In progress [17:28:20] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [17:28:47] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:33:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 19): Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#10670161 (10fnegri) This one is just "stalled" but not blocked. I'm a bit out of ideas on how to debug further. I've just rechecked t... [17:35:05] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [17:35:24] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:40:26] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [17:45:18] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:57:50] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [17:59:39] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component builds-builder [18:10:56] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-builder [18:11:25] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component builds-builder [18:16:32] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component builds-builder [18:19:32] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component builds-builder [18:24:16] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component builds-builder [18:40:14] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component builds-builder [18:52:14] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-builder [18:57:24] 10Toolforge (Toolforge iteration 19): [builds-api] Limit the amount of running builds - https://phabricator.wikimedia.org/T388706#10670853 (10dcaro) a:03dcaro [18:57:27] 10Toolforge (Toolforge iteration 19): [builds-api] Limit the amount of running builds - https://phabricator.wikimedia.org/T388706#10670855 (10dcaro) 05Open→03In progress [21:09:26] 10Tool-paulina: Filter: is author in public domain - https://phabricator.wikimedia.org/T388575#10671473 (10Piracalamina) Usage of property P7763 https://w.wiki/DZWA [21:11:39] 10Tool-paulina: Filter: is author in public domain - https://phabricator.wikimedia.org/T388575#10671480 (10Piracalamina) [21:55:16] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885 (10bd808) 03NEW [22:00:52] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885#10671667 (10bd808) [22:03:49] 10Tool-ldap: Display WMF cluster permissions granted to a Developer account - https://phabricator.wikimedia.org/T389885#10671668 (10bd808) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/admin/data/matrix.py might be inspiration for how to work with the data.yaml... [22:18:39] 06cloud-services-team, 10Cloud-VPS, 07LDAP: novaadmin LDAP user is a member of nonexistent LDAP groups - https://phabricator.wikimedia.org/T378847#10671719 (10bd808) The wikinewsie Cloud VPS project was deleted on or around 2023-05-22: * {T281600} * https://wikitech.wikimedia.org/w/index.php?title=News/2022_...