[00:03:35] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [00:03:51] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [00:08:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [00:11:33] (03update) 10raymond-ndibe: [config] support port protocol [repos/cloud/toolforge/components-api] (handle_unset_and_default_arguments_consistently) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T401994) [00:14:42] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [00:15:45] (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro) [00:18:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [00:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [00:58:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:01:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:08:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:56:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [05:47:39] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11188066 (... [05:53:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:53:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:43:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:43:21] RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 29.3M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [08:08:30] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-66, tools-k8s-worker-nfs-82, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-10 [08:19:56] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662#11188336 (10dcaro) 05Open→03Declined Last week I trie... [08:23:55] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-66, tools-k8s-worker-nfs-82, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-10 [08:42:21] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11188379 (10dcaro) Today we got a few more nodes stuck, a quick look into `tools-k8s-worker-nfs-17.tools.eqiad1.wikimedia.cloud` showed that there w... [08:49:53] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11188394 (10dcaro) From today also, on `tools-k8s-worker-nfs-17`: ` root@tools-k8s-worker-nfs-17:~# journalctl --boot -1 | grep tools-nfs Aug 23 13... [09:00:28] FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:02:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [09:24:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:34:47] !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:35:02] !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:35:10] !log dcaro@acme tools START - Cookbook wmcs.vps.instance.force_reboot vm tools-prometheus-9 (cluster eqiad1, project tools) [09:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:35:13] !log dcaro@acme tools END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm tools-prometheus-9 (cluster eqiad1, project tools) [09:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:35:20] !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:35:33] !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:40:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:42:48] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11188523 (10dcaro) This happened again today, had to force restart the vm. [10:13:56] 10Toolforge (Toolforge iteration 24): [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833 (10dcaro) 03NEW [10:30:16] 10Toolforge (Toolforge iteration 24): [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11188812 (10dcaro) No idea yet why this is not working, both clouddumps1001 and 1002 have the same exact `/etc/exports` file, and mounting from 1002 works... [10:31:08] 10Toolforge (Toolforge iteration 24): [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11188818 (10dcaro) the `/etc/fstab` of this worker is the same as others (ex. nfs-47) too :/ ` root@tools-k8s-worker-nfs-17:~# cat /etc/fstab # HEADER: T... [11:20:51] (03PS1) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) [11:35:31] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189018 (10Aklapper) [11:52:14] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189082 (10fgiunchedi) Looking at `tshark -i any 'host clouddumps1001.wikimedia.org'` on -17 it shows an NFS client ID already in use, still investigating... [11:55:38] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11189098 (10dcaro) On the system side, I can see in journal the logs: ` Sep 17 08:58:37 tools-prometheus-9 prometheus@tools[1623330]: ts=2025-09-17T08:58:37.496Z caller=scrape.g... [11:57:34] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11189101 (10dcaro) The last one specifically: ` root@tools-prometheus-9:~# cat /var/log/prometheus/query.log | grep 'T08:5.:..' | tail -n1 |jq { "params": { "end": "2025-... [12:03:54] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189112 (10fgiunchedi) The nfs server on 1001 still has the -17 client in its records, though in status `courtesy` ` root@clouddumps1001:~# cat /proc/fs/... [12:06:12] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189123 (10fgiunchedi) Not a root cause analysis, though manually kicking the client did the trick: ` root@clouddumps1001:~# echo expire > /proc/fs/nfsd/... [12:09:00] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189130 (10fgiunchedi) FTR this is the client record after it came back: ` root@clouddumps1001:~# cat /proc/fs/nfsd/clients/2162/info clientid: 0xfbab99f... [12:10:02] (03close) 10dcaro: volumes: mount /etc/openstack/clouds.yaml [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/24 (https://phabricator.wikimedia.org/T379030) (owner: 10aborrero) [12:12:40] (03open) 10dcaro: toolforge: add new needed clouds.yaml file [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275 [12:14:00] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189139 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm optimistically resolving, while it isn't clear to me what caused the client to get i... [12:16:55] 06cloud-services-team, 10Toolforge: Mount /etc/openstack/clouds.yaml in mount-enabled containers - https://phabricator.wikimedia.org/T404438#11189155 (10dcaro) This broke lima-kilo, fixing in https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275 [12:23:01] 06cloud-services-team, 10Toolforge: [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849 (10dcaro) 03NEW [12:23:58] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [12:24:36] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189192 (10dcaro) [12:25:21] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189193 (10dcaro) A second run did not find this issue, lowering prio: ` TASK [ldap_users : Inject tool accounts into foxtrot-ldap] ******************... [12:25:27] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189195 (10dcaro) p:05Triage→03Low [12:26:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:36:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:38:38] 06cloud-services-team, 10Toolforge: [toolforge-api] non-existant job not complying to kubernete's naming returns 400 error - https://phabricator.wikimedia.org/T404852 (10DamianZaremba) 03NEW [12:41:02] (03CR) 10Brouberol: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [12:53:35] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-43, tools-k8s-worker-nfs-32 [12:56:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:02:18] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: DRAFT Decision request - Improving lima-kilo developer experience - https://phabricator.wikimedia.org/T403051#11189305 (10dcaro) [13:03:35] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: DRAFT Decision request - Focus for improving lima-kilo developer experience - https://phabricator.wikimedia.org/T403051#11189317 (10dcaro) [13:03:58] (03PS2) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) [13:05:17] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-43, tools-k8s-worker-nfs-32 [13:07:16] (03approved) 10dcaro: toolforge: add new needed clouds.yaml file [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275 [13:07:21] (03merge) 10dcaro: toolforge: add new needed clouds.yaml file [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275 [13:10:05] (03approved) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [13:10:08] (03update) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [13:10:20] (03approved) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:10:24] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:11:15] (03merge) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963 [13:11:18] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [13:15:58] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189361 (10dcaro) 05Open→03Declined I reran it again without issues, if it happens anew from a clean install I'll reopen and investigate. [13:15:59] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189364 (10dcaro) 05Declined→03Resolved [13:22:09] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11189385 (10dcaro) It got unstuck? ` tools.itwiki@tools-bastion-15:~$ tail /data/project/itwiki/draftbot/logs/job-cont.err [2025-09-17T13:19:54Z] b8bcbd30 2025-09-17T13:19:54.308Z... [13:26:05] 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11189393 (10dcaro) There were a few workers having NFS issues though, and they were restarted, so that might have forced it to restart somewhere else (see {T404584} if you are inte... [13:26:16] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:30:39] (03merge) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:31:26] (03approved) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/60 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:31:29] (03merge) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/60 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:47:36] 06cloud-services-team, 10Cloud-VPS: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862 (10taavi) 03NEW [13:47:47] 06cloud-services-team, 10Cloud-VPS: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862#11189558 (10taavi) [13:47:48] 06cloud-services-team, 10Tool-openstack-browser: openstack-browser: Display Octavia load balancers - https://phabricator.wikimedia.org/T404419#11189557 (10taavi) [13:52:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-32 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [13:59:45] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup fails to start on linux - https://phabricator.wikimedia.org/T404226#11189632 (10dcaro) 05In progress→03Resolved [14:08:03] 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189664 (10fgiunchedi) FTR tools-k8s-worker-nfs-32, which I rebooted earlier today, suffered the same fate (unable to mount, client status `courtesy`... [14:13:49] (03PS1) 10Jean-Frédéric: Bump PHP dependencies in composer.json [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 [14:17:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-32 in project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [14:20:01] (03update) 10don-vip: Draft: DVIDS: incremental update service [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/4 [14:52:28] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11189949 (10fgiunchedi) I see the host memory used spiking up just before the metrics cutoff, presumably due to prometheus exploding in memory {F66027968} I recommend setting... [16:19:15] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11190630 (... [16:20:15] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11190631 (... [16:27:32] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11190675 (... [18:18:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:40:56] FIRING: SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [18:48:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:49:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [18:54:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [19:00:56] RESOLVED: SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:07:00] (03PS1) 10Andrew Bogott: Fix .gitreview after a merge mishap [openstack/horizon/horizon] (rebuild) - 10https://gerrit.wikimedia.org/r/1189291 [19:11:08] (03PS1) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189292 [19:11:25] (03Abandoned) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189292 (owner: 10Andrew Bogott) [19:33:25] (03PS1) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189297 [20:08:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:08:55] (03CR) 10Andrew Bogott: [C:03+1] views: Don't crash when encountering a proxy using IPv6 backends [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187393 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah) [20:10:20] (03CR) 10Andrew Bogott: [C:03+1] "LGTM! Be warned that I've just done a bunch of horizon updates and the current latest image build is only deployed in codfw1dev; let's mak" [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187394 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah) [20:40:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [21:21:17] (03CR) 10Lokal Profil: [C:03+2] Bump PHP dependencies in composer.json [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 (owner: 10Jean-Frédéric) [21:23:37] (03CR) 10CI reject: [V:04-1] Bump PHP dependencies in composer.json [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 (owner: 10Jean-Frédéric) [22:04:15] (03PS2) 10Jean-Frédéric: Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 [22:05:05] (03PS3) 10Jean-Frédéric: Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 [22:07:19] (03CR) 10CI reject: [V:04-1] Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 (owner: 10Jean-Frédéric) [22:20:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:25:18] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:35:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:50:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:55:12] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [23:10:24] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11192160 (... [23:17:54] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning