[00:08:52] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset developer account password and email address for "taxonbot" user - https://phabricator.wikimedia.org/T398220#11076292 (10doctaxon) [00:34:50] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset developer account password and email address for "taxonbot" user - https://phabricator.wikimedia.org/T398220#11076324 (10bd808) >>! In T398220#11076282, @doctaxon wrote: > @bd808 : The new email address is dr.taxon[at]gmail.com Perfect,... [00:35:21] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11076325 (10bd808) [01:01:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:08:08] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11076349 (10doctaxon) @bd808 : Let's try the second confirming method. These stewards know me in real life and can verify my account: @Johannnes89... [01:17:10] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11076363 (10doctaxon) I'm not sure but I think I didn't lose the ssh key, I only lost my password. [01:20:35] (03open) 10raymond-ndibe: [cli] add tool config to deployment object [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/58 (https://phabricator.wikimedia.org/T400064) [01:35:11] 10Toolforge (Toolforge iteration 23): [components-api] exclude defaults when getting deployment - https://phabricator.wikimedia.org/T401648 (10Raymond_Ndibe) 03NEW [01:39:00] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11076402 (10doctaxon) @bd808 : I tried the first verification method and got the verification file output: `tools-bastion-12.tools.eqiad1.wikimedi... [01:46:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:46:44] (03open) 10raymond-ndibe: [tool_router.py] exclude default values from deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/117 (https://phabricator.wikimedia.org/T401648) [01:47:23] (03update) 10raymond-ndibe: [cli] add tool config to deployment object [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/58 (https://phabricator.wikimedia.org/T400064) [02:07:39] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [jobs-api] Periodically refresh image-config data - https://phabricator.wikimedia.org/T357112#11076441 (10Raymond_Ndibe) 05In progress→03Resolved [02:32:50] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] exclude defaults when getting deployment - https://phabricator.wikimedia.org/T401648#11076450 (10Raymond_Ndibe) 05Open→03In progress [02:34:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [02:35:01] 10Toolforge (Toolforge iteration 23): [toolforge-deploy] detect deployed cli versions in lima-kilo - https://phabricator.wikimedia.org/T401649 (10Raymond_Ndibe) 03NEW [02:35:11] 10Toolforge (Toolforge iteration 23): [toolforge-deploy] detect deployed cli versions in lima-kilo - https://phabricator.wikimedia.org/T401649#11076463 (10Raymond_Ndibe) a:05Raymond_Ndibe→03None [02:35:58] 10Toolforge (Toolforge iteration 23): [toolforge-deploy] detect deployed cli versions in lima-kilo - https://phabricator.wikimedia.org/T401649#11076464 (10Raymond_Ndibe) [03:16:51] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component builds-cli [03:18:51] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-cli [03:19:24] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component builds-cli [03:19:39] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component envvars-cli [03:20:04] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component envvars-cli [03:21:15] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-cli [03:24:02] (03approved) 10raymond-ndibe: d/changelog: bump to 0.0.23 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/117 [03:24:03] (03update) 10raymond-ndibe: d/changelog: bump to 0.0.23 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/117 [03:24:12] (03merge) 10raymond-ndibe: d/changelog: bump to 0.0.23 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/117 [03:24:38] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component toolforge-cli [03:24:58] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component toolforge-cli [03:30:35] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component toolforge-cli [03:34:23] (03update) 10raymond-ndibe: d/changelog: bump to 0.0.14 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/86 (https://phabricator.wikimedia.org/T363544 https://phabricator.wikimedia.org/T400616) [03:42:54] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component toolforge-cli [03:44:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:44:51] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component toolforge-cli [03:45:04] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component envvars-cli [03:45:08] !log raymond-ndibe@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component envvars-cli [03:45:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:48:38] (03update) 10raymond-ndibe: d/changelog: bump to 0.0.14 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/86 (https://phabricator.wikimedia.org/T363544 https://phabricator.wikimedia.org/T400616) [03:50:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:51:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:56:02] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component toolforge-cli [03:56:18] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component envvars-cli [03:56:23] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component envvars-cli [03:56:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:56:35] (03approved) 10raymond-ndibe: d/changelog: bump to 0.0.14 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/86 (https://phabricator.wikimedia.org/T363544 https://phabricator.wikimedia.org/T400616) [03:56:41] (03merge) 10raymond-ndibe: d/changelog: bump to 0.0.14 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/86 (https://phabricator.wikimedia.org/T363544 https://phabricator.wikimedia.org/T400616) [03:56:54] (03update) 10raymond-ndibe: d/changelog: bump to 0.0.14 [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/86 (https://phabricator.wikimedia.org/T363544 https://phabricator.wikimedia.org/T400616) [03:58:48] (03update) 10raymond-ndibe: d/changelog: bump to 0.3.7 [repos/cloud/toolforge/toolforge-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/merge_requests/50 [03:58:49] (03approved) 10raymond-ndibe: d/changelog: bump to 0.3.7 [repos/cloud/toolforge/toolforge-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/merge_requests/50 [03:58:52] (03merge) 10raymond-ndibe: d/changelog: bump to 0.3.7 [repos/cloud/toolforge/toolforge-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/merge_requests/50 [05:21:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:14:46] (03CR) 10Majavah: Add Trixie images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [07:16:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [07:21:44] 06cloud-services-team, 10Toolforge: [toolforge-deploy] detect deployed cli versions in lima-kilo - https://phabricator.wikimedia.org/T401649#11076826 (10taavi) [08:07:29] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [08:18:44] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [08:21:23] 06cloud-services-team, 10Toolforge: [toolforge-deploy] detect deployed cli versions in lima-kilo - https://phabricator.wikimedia.org/T401649#11077000 (10dcaro) This still works for me: ` 10:20 AM /home/dcaro/Work/wikimedia/lima-kilo (fix_webservice_tools_cli_components|✔) dcaro@toolslocal$ toolforge_get_versio... [08:22:04] 06cloud-services-team, 10Toolforge: [toolforge-deploy] detect deployed cli versions in lima-kilo - https://phabricator.wikimedia.org/T401649#11077002 (10dcaro) ` dcaro@toolslocal$ git log -1 commit 01ccee51d7438dfda7d53bc2a0f90c351bf98eba (HEAD -> main, origin/main, origin/fix_webservice_tools_cli_components,... [08:26:14] 06cloud-services-team, 14Toolforge (Toolforge iteration 19), 13Patch-For-Review: [envvars-cli] Add option to not show envvar values when listing - https://phabricator.wikimedia.org/T363544#11077014 (10dcaro) Envvars list now shows asterisks: ` local.tf-test@toolslocal:~$ toolforge envvars list name va... [08:29:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [08:29:09] (03update) 10dcaro: toolforge: create an 'admin' tool account, with a fake human user [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/243 (https://phabricator.wikimedia.org/T394786) (owner: 10aborrero) [08:32:48] 06cloud-services-team, 10Cloud-VPS (Project-requests): Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347#11077066 (10fnegri) > I can't take credit, I just subscribed as it's a ClueBot issue... the thanks goes to @DamianZaremba @RichSmith My fault, I don't know how I failed to read the `A... [08:45:17] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] exclude defaults when getting deployment - https://phabricator.wikimedia.org/T401648#11077105 (10dcaro) I'm not sure if we should be excluding defaults, or just the ones that are `None`, for example `status`, `force_build` and `force_ru... [09:09:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:14:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-103 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:19:56] 10Cloud-VPS (Project-requests): Request creation of VPS project - https://phabricator.wikimedia.org/T401619#11077221 (10Aklapper) > Got some errors Please see https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication how to get help with errors [09:50:35] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for [09:50:35] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for [09:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:52:27] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for [09:52:27] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for [09:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:52:41] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-103, tools-k8s-worker-nfs-36 [09:52:41] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-103, tools-k8s-worker-nfs-36 [09:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:53:28] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-103, tools-k8s-worker-nfs-36 [09:53:29] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-103, tools-k8s-worker-nfs-36 [09:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:53:33] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers , no stuck workers found [09:53:34] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) , no stuck workers found [09:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [09:54:03] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers for tools-k8s-worker-103, tools-k8s-worker-nfs-36 [09:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:58:43] (03PS1) 10David Caro: reboot_stuck_workers: add net cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1177960 [10:00:47] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) for tools-k8s-worker-103, tools-k8s-worker-nfs-36 [10:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:02:20] (03CR) 10CI reject: [V:04-1] reboot_stuck_workers: add net cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1177960 (owner: 10David Caro) [10:02:43] (03PS2) 10David Caro: reboot_stuck_workers: add net cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1177960 [10:13:36] FIRING: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [10:17:52] (03open) 10taavi: Fix alerts using grafana-rw instead of grafana [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/33 [10:18:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] Upgrade to `3.3.9` chart (`1.13` app) for k8s 1.30 support - https://phabricator.wikimedia.org/T394787#11077403 (10dcaro) First try to upgrade on tools failed, error message: ` root@tools-k8s-control-9:~/toolforge-deploy# ./deploy.sh kyvern... [10:27:56] 10Cloud-VPS (Project-requests): Request creation of VPS project - https://phabricator.wikimedia.org/T401619#11077421 (10AlvinDulle) Thanks @Aklapper [10:44:35] (03approved) 10dcaro: Fix alerts using grafana-rw instead of grafana [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/33 (owner: 10taavi) [10:44:43] (03approved) 10fnegri: Fix alerts using grafana-rw instead of grafana [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/33 (owner: 10taavi) [10:44:58] (03merge) 10taavi: Fix alerts using grafana-rw instead of grafana [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/33 [10:59:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:03:11] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] Upgrade to `3.3.9` chart (`1.13` app) for k8s 1.30 support - https://phabricator.wikimedia.org/T394787#11077525 (10dcaro) All tests and policies are correctly in place, the issue seems to have affected the migration to the newer CRD version... [11:04:32] (03update) 10fnegri: Create .deb package [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/1 (https://phabricator.wikimedia.org/T395266) [11:04:33] (03update) 10fnegri: Create .deb package [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/1 (https://phabricator.wikimedia.org/T395266) [11:13:31] (03update) 10fnegri: Create .deb package [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/1 (https://phabricator.wikimedia.org/T395266) [11:17:41] 06cloud-services-team, 14Toolforge (Toolforge iteration 19), 13Patch-For-Review: [envvars-cli] Add option to not show envvar values when listing - https://phabricator.wikimedia.org/T363544#11077559 (10LucasWerkmeister) Nice, thanks! [11:47:16] (03approved) 10taavi: Create .deb package [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/1 (https://phabricator.wikimedia.org/T395266) (owner: 10fnegri) [12:13:33] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11077713 (10DerHexer) >>! In T398220#11076348, @doctaxon wrote: > @bd808 : Let's try the second confirming method. These stewards know me in real l... [12:23:35] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11077777 (10fgiunchedi) Thank you @Danilo for the data! I have set up a capture of lease files on cloudnet1005 and cloudnet1006 every five minutes yesterday, today I... [12:34:39] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] policy countres stopped showing correctly in grafana - https://phabricator.wikimedia.org/T401681 (10dcaro) 03NEW [12:35:18] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] policy countres stopped showing correctly in grafana - https://phabricator.wikimedia.org/T401681#11077867 (10dcaro) [12:37:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684 (10dcaro) 03NEW [12:39:15] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077906 (10dcaro) [12:44:05] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:46:10] (03CR) 10Jforrester: [C:03+2] build: Updating mediawiki/mediawiki-phan-config to 0.17.0 [labs/tools/coverme] - 10https://gerrit.wikimedia.org/r/1176870 (owner: 10Libraryupgrader) [12:50:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077941 (10dcaro) Ran the upgrade manually and it worked: ` root@tools-k8s-control-9:~# ./kyverno migrate --resource cleanuppolicie... [12:51:48] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] Upgrade to `3.3.9` chart (`1.13` app) for k8s 1.30 support - https://phabricator.wikimedia.org/T394787#11077947 (10dcaro) I ran: ` root@tools-k8s-control-9:~# wget https://github.com/kyverno/kyverno/releases/download/v1.13.6/kyverno-cli_v1.... [12:52:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077950 (10dcaro) Missed one ` Error: UPGRADE FAILED: cannot patch "policyexceptions.kyverno.io" with kind CustomResourceDefinitio... [12:54:38] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077954 (10dcaro) ` root@tools-k8s-control-9:~/toolforge-deploy# ../kyverno migrate --resource policyexceptions.kyverno.io migratin... [12:59:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077966 (10dcaro) Redeploy timed out, just like the previous upgrade: ` ARGS: 0: helm (4 bytes) 1: upgrade (7 bytes) 2: --ins... [13:00:52] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077968 (10dcaro) This seems to be the hook that timed out: https://github.com/kyverno/kyverno/blob/main/charts/kyverno/templates/h... [13:03:37] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11077973 (10dcaro) Might also be https://github.com/kyverno/kyverno/blob/release-1.13/charts/kyverno/templates/hooks/post-upgrade-cl... [13:14:39] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11078006 (10dcaro) Yep, the culprit seems to be that post-upgrade cleanup of reports, it iterates through all namespaces, checks if... [13:41:46] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11078097 (10dcaro) I added the value `helmDefaults.timeout: 1800` to the helmfile.yaml, and now it was able to get to it all: ` root... [13:41:51] (03update) 10dcaro: kyverno: upgrade to 3.3.9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/889 [13:46:11] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] upgrade to 3.3.9 in tools failed leaving a half-upgraded system - https://phabricator.wikimedia.org/T401684#11078107 (10dcaro) 05Open→03Resolved I'll resolve this for now, the patch is linked to the parent task. [14:14:18] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] policy countres stopped showing correctly in grafana - https://phabricator.wikimedia.org/T401681#11078240 (10dcaro) Updated the config to not delete the namespace related labels from that metric, and they are showing up now on grafana: {F... [14:16:42] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] policy countres stopped showing correctly in grafana - https://phabricator.wikimedia.org/T401681#11078245 (10dcaro) Hmm... the alert is summing by pod, without caring if there's any old pods: {F65741762} So it will take some time to go aw... [14:17:43] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693 (10Andrew) 03NEW [14:18:07] (03approved) 10dcaro: kyverno: upgrade to 3.3.9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/889 [14:18:11] (03update) 10dcaro: kyverno: upgrade to 3.3.9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/889 [14:18:29] (03merge) 10dcaro: kyverno: upgrade to 3.3.9 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/889 [14:26:02] 06cloud-services-team, 10PAWS: Grant membership in cloud-vps project 'PAWS' to vivian rook for volunteer work - https://phabricator.wikimedia.org/T400733#11078292 (10Andrew) 05Open→03Stalled @Corvid4444 and legal are at an impasse re: NDAs so we may not be able to grant this membership. We do need to keep... [14:26:27] (03open) 10dcaro: kyverno: check for existance of policies, not absence [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [14:51:37] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11078392 (10Johannnes89) >>! In T398220#11077713, @DerHexer wrote: >>>! In T398220#11076348, @doctaxon wrote: >> @bd808 : Let's try the second conf... [14:56:45] 06cloud-services-team, 10Cloud-VPS (Project-requests): Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347#11078430 (10Andrew) +1 sgtm [14:57:08] (03merge) 10fnegri: Create .deb package [repos/cloud/wikireplicas-utils] - 10https://gitlab.wikimedia.org/repos/cloud/wikireplicas-utils/-/merge_requests/1 (https://phabricator.wikimedia.org/T395266) [15:01:56] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11078448 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [15:03:48] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11078450 (10fgiunchedi) I have started a tcpdump to capture dhcp traffic on both cloudnet1005 and cloudnet1006; a `pkill tcpdump` is what it takes to interrupt the pr... [15:05:52] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11078467 (10taavi) My understanding is that David's `webservice` changes above allow using a custom-built image with the `shell` subcommand, w... [15:11:53] (03update) 10dcaro: kyverno: check for existance of policies, not absence [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [15:12:12] (03update) 10dcaro: kyverno: check for existance of policies, not absence [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [15:14:26] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] policy countres stopped showing correctly in grafana - https://phabricator.wikimedia.org/T401681#11078521 (10dcaro) p:05Triage→03Medium [15:21:08] (03CR) 10David Caro: [C:03+1] "LGTM did not test all, just a couple + some manually installed packages and such in the trixie container" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:24:16] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] (support-port-protocol-selection) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [15:25:13] (03CR) 10Majavah: [C:03+2] Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:25:48] (03Merged) 10jenkins-bot: Add Trixie images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1177345 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:28:08] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11078617 (10bd808) >>! In T360488#11078467, @taavi wrote: > My understanding is that David's `webservice` changes above allow using a custom-b... [15:30:09] (03PS1) 10Majavah: Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) [15:30:30] (03CR) 10Majavah: [C:03+2] Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) (owner: 10Majavah) [15:31:02] (03Merged) 10jenkins-bot: Replace Docker registry URL [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178015 (https://phabricator.wikimedia.org/T394902) (owner: 10Majavah) [15:31:12] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11078664 (10fgiunchedi) Ok I noticed the following on cloudnet1006 after ircservserv disconnected earlier today: ` Date: Tue Aug 12 15:18:37 2025 +0000 -1755097921... [15:38:06] (03open) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [15:38:19] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11078744 (10fgiunchedi) As found by @taavi we did run into this problem as part of {T351507}, quite likely some workers pre-date that and thus were never fixed to hav... [15:40:59] (03update) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [15:43:45] (03update) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [15:47:17] (03update) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [15:48:49] (03PS1) 10Majavah: Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) [15:51:11] (03open) 10taavi: data: Add Trixie based images [repos/cloud/toolforge/image-config] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/14 (https://phabricator.wikimedia.org/T400255) [15:51:27] (03CR) 10Majavah: [C:03+2] Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:52:09] (03Merged) 10jenkins-bot: Add mono612-sssd/base [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1178019 (https://phabricator.wikimedia.org/T400255) (owner: 10Majavah) [15:53:02] 06cloud-services-team, 10Toolforge: Add support for Python 3.13 - https://phabricator.wikimedia.org/T381899#11078829 (10taavi) 05Stalled→03Open a:03taavi [15:53:13] (03approved) 10dcaro: data: Add Trixie based images [repos/cloud/toolforge/image-config] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/14 (https://phabricator.wikimedia.org/T400255) (owner: 10taavi) [15:54:34] (03merge) 10taavi: data: Add Trixie based images [repos/cloud/toolforge/image-config] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/-/merge_requests/14 (https://phabricator.wikimedia.org/T400255) [15:56:07] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: image-config: bump to 0.0.21-20250812155445-067fec45 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/918 (https://phabricator.wikimedia.org/T400255) [15:56:14] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: image-config: bump to 0.0.21-20250812155445-067fec45 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/918 (https://phabricator.wikimedia.org/T400255) [15:58:03] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component image-config [15:59:05] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:59:35] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:03:59] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component image-config [16:04:34] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:04:35] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component image-config [16:04:47] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component image-config [16:05:34] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:07:27] (03update) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [16:08:35] (03update) 10dcaro: kyverno: use the number of namespaces as the policy count [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [16:11:24] 06cloud-services-team, 10Toolforge: Add support for Python 3.13 - https://phabricator.wikimedia.org/T381899#11078967 (10taavi) 05Open→03Resolved [16:11:35] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255#11078968 (10taavi) 05Open→03Resolved [16:13:13] (03update) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [16:14:12] (03approved) 10taavi: kyverno: use the number of namespaces as the policy count [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 (owner: 10dcaro) [16:15:25] (03update) 10dcaro: harbor: explicitly use http2 for curl [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [16:15:38] (03merge) 10dcaro: kyverno: use the number of namespaces as the policy count [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [16:15:48] (03merge) 10taavi: image-config: bump to 0.0.21-20250812155445-067fec45 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/918 (https://phabricator.wikimedia.org/T400255) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:16:02] (03update) 10dcaro: kyverno: use the number of namespaces as the policy count [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [16:29:45] 06cloud-services-team, 10Toolforge: Remove or replace "default" web service image - https://phabricator.wikimedia.org/T401715 (10taavi) 03NEW [16:30:40] 06cloud-services-team, 10Toolforge: Remove or replace "default" web service image - https://phabricator.wikimedia.org/T401715#11079114 (10taavi) [16:30:41] 06cloud-services-team, 10Toolforge: Stop building Bullseye based Toolforge prebuilt images - https://phabricator.wikimedia.org/T400258#11079115 (10taavi) [16:33:53] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge (Toolforge iteration 23): Decision request - Reuse toolforge user tools central logging for toolforge infrastructure logging - https://phabricator.wikimedia.org/T398285#11079137 (10fnegri) I don't have a strong opinion on this, I would probably g... [16:34:46] (03open) 10dcaro: kyverno: enable in toolsbeta [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/35 [16:35:14] (03approved) 10dcaro: kyverno: enable in toolsbeta [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/35 [16:35:40] (03merge) 10dcaro: kyverno: enable in toolsbeta [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/35 [16:36:54] 06cloud-services-team, 10Toolforge (Toolforge iteration 23): [kyverno] policy countres stopped showing correctly in grafana - https://phabricator.wikimedia.org/T401681#11079178 (10dcaro) 05Open→03Resolved Enabled on toolsbeta too [16:40:06] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [16:41:31] (03open) 10dcaro: kyverno: use the number of tools, not namespaces [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/36 [16:41:45] FIRING: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [16:42:29] (03approved) 10dcaro: kyverno: use the number of tools, not namespaces [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/36 [16:42:31] (03merge) 10dcaro: kyverno: use the number of tools, not namespaces [repos/cloud/toolforge/alerts] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/36 [16:43:10] 06cloud-services-team, 10Toolforge: Remove or replace "default" web service image - https://phabricator.wikimedia.org/T401715#11079206 (10bd808) If anything makes sense as a default backend in the current era I think it would be `buildservice` as we seem to generally want to move folks towards their own images... [16:45:12] (03approved) 10dcaro: harbor: support http1 and 2 for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [16:45:14] (03merge) 10dcaro: harbor: support http1 and 2 for the tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/917 [16:46:07] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] (support-port-protocol-selection) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [16:46:44] 06cloud-services-team, 10Toolforge: Remove or replace "default" web service image - https://phabricator.wikimedia.org/T401715#11079221 (10taavi) Using `buildservice` as a default could work as well if {T363065} is fixed first. [16:48:09] (03PS3) 10David Caro: reboot_stuck_workers: add new cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1177960 [16:48:52] !log fnegri@cloudcumin1001 cluebotng-review START - Cookbook wmcs.vps.create_project for trove-only project cluebotng-review in eqiad1 (T401347) [16:48:54] fnegri@cloudcumin1001: Unknown project "cluebotng-review" [16:48:54] fnegri@cloudcumin1001: Did you mean to say "tools.cluebotng-review" instead? [16:48:55] T401347: Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347 [16:49:31] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project cluebotng-review [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/260 (https://phabricator.wikimedia.org/T401347) [16:53:02] fnegri@cloudcumin1001 create_project (PID 2254156) is awaiting input [16:54:31] (03approved) 10andrew: projects: added project cluebotng-review [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/260 (https://phabricator.wikimedia.org/T401347) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [16:55:49] (03merge) 10fnegri: projects: added project cluebotng-review [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/260 (https://phabricator.wikimedia.org/T401347) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [16:57:15] RESOLVED: Toolforge Kyverno low policy resources: Toolforge Kyverno has low amount of policy resources - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/Toolforge_Kyverno_low_policy_resources - https://grafana.wmcloud.org/d/kyverno/kyverno?orgId=1&var-DS_PROMETHEUS_KYVERNO=prometheus-tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforge+Kyverno+low+policy+resources [17:00:50] fnegri@cloudcumin1001 create_project (PID 2254156) is awaiting input [17:05:35] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:05:37] !log fnegri@cloudcumin1001 cluebotng-review END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for trove-only project cluebotng-review in eqiad1 (T401347) [17:05:41] T401347: Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347 [17:15:23] 06cloud-services-team, 10Cloud-VPS (Project-requests), 13Patch-For-Review: Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347#11079371 (10fnegri) 05Open→03Resolved @DamianZaremba you should have access to the new Cloud VPS project `cluebotng-review`, where you should be able to c... [17:50:28] 06cloud-services-team, 10Cloud-VPS (Project-requests): Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347#11079554 (10fnegri) //Note for admins: I added a new section [Creating Trove-only projects](https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Creating_Trove-only_pro... [18:24:16] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079633 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:... [18:28:56] FIRING: SystemdUnitDown: The service unit prometheus-ethtool-exporter.service is in failed status on host cloudcontrol1011. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1011 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:31:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [18:34:21] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [18:44:48] FIRING: PuppetFailure: Puppet has failed on cloudcontrol1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:44:55] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol1011:9100 - https://phabricator.wikimedia.org/T401735 (10phaultfinder) 03NEW [18:45:11] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#11079758 (10Andrew) @dcaro's patch (https://gerrit.wikimedia.org/r/1173352) seems to have resolved this! [18:46:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-sgebastion-10 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [18:49:48] FIRING: [2x] PuppetFailure: Puppet has failed on cloudcontrol1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:49:54] 06cloud-services-team: PuppetFailure - https://phabricator.wikimedia.org/T401736 (10phaultfinder) 03NEW [18:53:56] RESOLVED: SystemdUnitDown: The service unit prometheus-ethtool-exporter.service is in failed status on host cloudcontrol1011. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1011 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:41:22] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11079993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:... [19:51:34] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [19:55:06] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [19:58:06] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for all services [20:06:13] 06cloud-services-team, 10TaxonBot, 06Trust-and-Safety, 07LDAP: Reset email address for "taxonbot" Developer account - https://phabricator.wikimedia.org/T398220#11080123 (10bd808) 05Open→03Resolved p:05Triage→03Medium a:03bd808 >>! In T398220#11076402, @doctaxon wrote: > @bd808 : I tried the f... [20:24:48] FIRING: [2x] PuppetFailure: Puppet has failed on cloudcontrol1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:29:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on cloudcontrol1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:41:13] 06cloud-services-team, 10PAWS: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T401076#11080266 (10LibUp-bot) A new upstream version of Pywikibot is now available: 10.3.2. * https://gerrit.wikimedia.org/g/pywikibot/core/+/refs/tags/10.3.2 * https://doc.wikimedia.org/pywikibot/stable/chan... [20:41:17] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T401077#11080267 (10LibUp-bot) A new upstream version of Pywikibot is now available: 10.3.2. * https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Pywikibot_image * https://gerrit.wikimedia.org/g/p... [20:59:39] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080335 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:11:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:14:21] 06cloud-services-team, 06DC-Ops, 10ops-eqiad, 06SRE: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11080438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye executed with errors:... [21:59:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance tools-harbor-2 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [22:06:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:06:34] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:11:34] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-107 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:16:47] 06cloud-services-team, 10Striker, 10GitLab (Auth & Access), 10Release-Engineering-Team (Radar): Automatically approve GitLab accounts created by Striker integration - https://phabricator.wikimedia.org/T344667#11080590 (10bd808) 05Open→03Resolved a:03bd808 >>! In T344667#9431444, @bd808 wrote: > T... [22:46:48] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: Le service Internet Archive ne parvient pas à archiver tous les liens externes via ArchiveExternaLinks - https://phabricator.wikimedia.org/T401760 (10poro26) 03NEW [23:43:55] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: Remplacer Internet Archive par Wikiwix Archive dans ArchiveExternaLinks - https://phabricator.wikimedia.org/T401764 (10poro26) 03NEW [23:45:15] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: Remplacer Internet Archive par Wikiwix Archive dans ArchiveExternaLinks - https://phabricator.wikimedia.org/T401764#11080821 (10poro26) a:03poro26 [23:48:52] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: [Documentation] Le service Internet Archive ne parvient pas à archiver tous les liens externes via ArchiveExternaLinks - https://phabricator.wikimedia.org/T401760#11080824 (10poro26) [23:49:13] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: [Documentation] Le service Internet Archive ne parvient pas à archiver tous les liens externes via ArchiveExternaLinks - https://phabricator.wikimedia.org/T401760#11080825 (10poro26) a:03poro26 [23:49:35] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: [Documentation] Remplacer Internet Archive par Wikiwix Archive dans ArchiveExternaLinks - https://phabricator.wikimedia.org/T401764#11080826 (10poro26)