[00:03:35] <wikibugs>	 (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595)
[00:03:51] <wikibugs>	 (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213
[00:08:03] <wmcs-alerts>	 FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[00:11:33] <wikibugs>	 (03update) 10raymond-ndibe: [config] support port protocol [repos/cloud/toolforge/components-api] (handle_unset_and_default_arguments_consistently) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T401994)
[00:14:42] <wikibugs>	 (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595)
[00:15:45] <wikibugs>	 (03update) 10raymond-ndibe: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 (owner: 10dcaro)
[00:18:03] <wmcs-alerts>	 FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[00:31:55] <wmcs-alerts>	 FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity
[00:58:03] <wmcs-alerts>	 FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[03:01:55] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity
[03:08:03] <wmcs-alerts>	 FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[03:31:55] <wmcs-alerts>	 FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity
[03:56:55] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity
[05:47:39] <wikibugs>	 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11188066 (...
[05:53:03] <wmcs-alerts>	 FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[06:53:03] <wmcs-alerts>	 FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[07:43:03] <wmcs-alerts>	 FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[07:43:21] <wmcs-alerts>	 RESOLVED: MaintainKubeusersHang: maintain-kubeusers last finished run is 29.3M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang
[08:08:30] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-66, tools-k8s-worker-nfs-82, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-10
[08:19:56] <wikibugs>	 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662#11188336 (10dcaro) 05Open→03Declined Last week I trie...
[08:23:55] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-66, tools-k8s-worker-nfs-82, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-10
[08:42:21] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11188379 (10dcaro) Today we got a few more nodes stuck, a quick look into `tools-k8s-worker-nfs-17.tools.eqiad1.wikimedia.cloud` showed that there w...
[08:49:53] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11188394 (10dcaro) From today also, on `tools-k8s-worker-nfs-17`: ` root@tools-k8s-worker-nfs-17:~# journalctl  --boot -1 | grep tools-nfs Aug 23 13...
[09:00:28] <wmcs-alerts>	 FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:02:28] <wmcs-alerts>	 FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[09:24:33] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[09:34:47] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console
[09:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:35:02] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
[09:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:35:10] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.vps.instance.force_reboot vm tools-prometheus-9 (cluster eqiad1, project tools)
[09:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:35:13] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm tools-prometheus-9 (cluster eqiad1, project tools)
[09:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:35:20] <wm-bot2>	 !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console
[09:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:35:33] <wm-bot2>	 !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0)
[09:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[09:40:28] <wmcs-alerts>	 RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[09:42:48] <wikibugs>	 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11188523 (10dcaro) This happened again today, had to force restart the vm.
[10:13:56] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833 (10dcaro) 03NEW
[10:30:16] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11188812 (10dcaro) No idea yet why this is not working, both clouddumps1001 and 1002 have the same exact `/etc/exports` file, and mounting from 1002 works...
[10:31:08] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,pupppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11188818 (10dcaro) the `/etc/fstab` of this worker is the same as others (ex. nfs-47) too :/  ` root@tools-k8s-worker-nfs-17:~# cat /etc/fstab # HEADER: T...
[11:20:51] <wikibugs>	 (03PS1) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576)
[11:35:31] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189018 (10Aklapper)
[11:52:14] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189082 (10fgiunchedi) Looking at `tshark -i any 'host clouddumps1001.wikimedia.org'` on -17 it shows an NFS client ID already in use, still investigating...
[11:55:38] <wikibugs>	 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11189098 (10dcaro) On the system side, I can see in journal the logs: ` Sep 17 08:58:37 tools-prometheus-9 prometheus@tools[1623330]: ts=2025-09-17T08:58:37.496Z caller=scrape.g...
[11:57:34] <wikibugs>	 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11189101 (10dcaro) The last one specifically: ` root@tools-prometheus-9:~# cat /var/log/prometheus/query.log  | grep 'T08:5.:..' | tail -n1 |jq {   "params": {     "end": "2025-...
[12:03:54] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189112 (10fgiunchedi) The nfs server on 1001 still has the -17 client in its records, though in status `courtesy`  ` root@clouddumps1001:~# cat /proc/fs/...
[12:06:12] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189123 (10fgiunchedi) Not a root cause analysis, though manually kicking the client did the trick:  ` root@clouddumps1001:~# echo expire > /proc/fs/nfsd/...
[12:09:00] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189130 (10fgiunchedi) FTR this is the client record after it came back:  ` root@clouddumps1001:~# cat /proc/fs/nfsd/clients/2162/info clientid: 0xfbab99f...
[12:10:02] <wikibugs>	 (03close) 10dcaro: volumes: mount /etc/openstack/clouds.yaml [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/24 (https://phabricator.wikimedia.org/T379030) (owner: 10aborrero)
[12:12:40] <wikibugs>	 (03open) 10dcaro: toolforge: add new needed clouds.yaml file [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275
[12:14:00] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189139 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm optimistically resolving, while it isn't clear to me what caused the client to get i...
[12:16:55] <wikibugs>	 06cloud-services-team, 10Toolforge: Mount /etc/openstack/clouds.yaml in mount-enabled containers - https://phabricator.wikimedia.org/T404438#11189155 (10dcaro) This broke lima-kilo, fixing in https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275
[12:23:01] <wikibugs>	 06cloud-services-team, 10Toolforge: [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849 (10dcaro) 03NEW
[12:23:58] <wmcs-alerts>	 RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-17 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[12:24:36] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189192 (10dcaro)
[12:25:21] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189193 (10dcaro) A second run did not find this issue, lowering prio: ` TASK [ldap_users : Inject tool accounts into foxtrot-ldap] ******************...
[12:25:27] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189195 (10dcaro) p:05Triage→03Low
[12:26:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[12:36:03] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[12:38:38] <wikibugs>	 06cloud-services-team, 10Toolforge: [toolforge-api] non-existant job not complying to kubernete's naming returns 400 error - https://phabricator.wikimedia.org/T404852 (10DamianZaremba) 03NEW
[12:41:02] <wikibugs>	 (03CR) 10Brouberol: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene)
[12:53:35] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-43, tools-k8s-worker-nfs-32
[12:56:03] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[13:02:18] <wikibugs>	 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: DRAFT Decision request - Improving lima-kilo developer experience - https://phabricator.wikimedia.org/T403051#11189305 (10dcaro)
[13:03:35] <wikibugs>	 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: DRAFT Decision request - Focus for improving lima-kilo developer experience - https://phabricator.wikimedia.org/T403051#11189317 (10dcaro)
[13:03:58] <wikibugs>	 (03PS2) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576)
[13:05:17] <logmsgbot_cloud>	 !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-43, tools-k8s-worker-nfs-32
[13:07:16] <wikibugs>	 (03approved) 10dcaro: toolforge: add new needed clouds.yaml file [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275
[13:07:21] <wikibugs>	 (03merge) 10dcaro: toolforge: add new needed clouds.yaml file [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/275
[13:10:05] <wikibugs>	 (03approved) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963
[13:10:08] <wikibugs>	 (03update) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963
[13:10:20] <wikibugs>	 (03approved) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961
[13:10:24] <wikibugs>	 (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961
[13:11:15] <wikibugs>	 (03merge) 10dcaro: loki: fix local networkpolicy [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/963
[13:11:18] <wmcs-alerts>	 RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce
[13:15:58] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189361 (10dcaro) 05Open→03Declined I reran it again without issues, if it happens anew from a clean install I'll reopen and investigate.
[13:15:59] <wikibugs>	 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [lima-kilo] test foxtrot-ldap potential race condition - https://phabricator.wikimedia.org/T404849#11189364 (10dcaro) 05Declined→03Resolved
[13:22:09] <wikibugs>	 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11189385 (10dcaro) It got unstuck? ` tools.itwiki@tools-bastion-15:~$ tail /data/project/itwiki/draftbot/logs/job-cont.err [2025-09-17T13:19:54Z] b8bcbd30 2025-09-17T13:19:54.308Z...
[13:26:05] <wikibugs>	 06cloud-services-team, 10Toolforge: Job not restarting despite liveness probe failures - https://phabricator.wikimedia.org/T400957#11189393 (10dcaro) There were a few workers having NFS issues though, and they were restarted, so that might have forced it to restart somewhere else (see {T404584} if you are inte...
[13:26:16] <wikibugs>	 (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961
[13:30:39] <wikibugs>	 (03merge) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961
[13:31:26] <wikibugs>	 (03approved) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/60 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:31:29] <wikibugs>	 (03merge) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/60 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620)
[13:47:36] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862 (10taavi) 03NEW
[13:47:47] <wikibugs>	 06cloud-services-team, 10Cloud-VPS: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862#11189558 (10taavi)
[13:47:48] <wikibugs>	 06cloud-services-team, 10Tool-openstack-browser: openstack-browser: Display Octavia load balancers - https://phabricator.wikimedia.org/T404419#11189557 (10taavi)
[13:52:28] <wmcs-alerts>	 FIRING: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-32 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[13:59:45] <wikibugs>	 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup fails to start on linux - https://phabricator.wikimedia.org/T404226#11189632 (10dcaro) 05In progress→03Resolved
[14:08:03] <wikibugs>	 10Toolforge (Toolforge iteration 24): [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet - https://phabricator.wikimedia.org/T404833#11189664 (10fgiunchedi) FTR tools-k8s-worker-nfs-32, which I rebooted earlier today, suffered the same fate (unable to mount, client status `courtesy`...
[14:13:49] <wikibugs>	 (03PS1) 10Jean-Frédéric: Bump PHP dependencies in composer.json [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203
[14:17:28] <wmcs-alerts>	 RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance tools-k8s-worker-nfs-32 in project tools   - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure
[14:20:01] <wikibugs>	 (03update) 10don-vip: Draft: DVIDS: incremental update service [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/4
[14:52:28] <wikibugs>	 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11189949 (10fgiunchedi) I see the host memory used spiking up just before the metrics cutoff, presumably due to prometheus exploding in memory  {F66027968}  I recommend setting...
[16:19:15] <wikibugs>	 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11190630 (...
[16:20:15] <wikibugs>	 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11190631 (...
[16:27:32] <wikibugs>	 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11190675 (...
[18:18:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[18:40:56] <jinxer-wm>	 FIRING: SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[18:48:03] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[18:49:33] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[18:54:33] <wmcs-alerts>	 RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[19:00:56] <jinxer-wm>	 RESOLVED: SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown
[19:07:00] <wikibugs>	 (03PS1) 10Andrew Bogott: Fix .gitreview after a merge mishap [openstack/horizon/horizon] (rebuild) - 10https://gerrit.wikimedia.org/r/1189291
[19:11:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189292
[19:11:25] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189292 (owner: 10Andrew Bogott)
[19:33:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189297
[20:08:09] <jinxer-wm>	 FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning
[20:08:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] views: Don't crash when encountering a proxy using IPv6 backends [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187393 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah)
[20:10:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "LGTM! Be warned that I've just done a bunch of horizon updates and the current latest image build is only deployed in codfw1dev; let's mak" [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187394 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah)
[20:40:03] <wmcs-alerts>	 FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses
[21:21:17] <wikibugs>	 (03CR) 10Lokal Profil: [C:03+2] Bump PHP dependencies in composer.json [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 (owner: 10Jean-Frédéric)
[21:23:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Bump PHP dependencies in composer.json [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 (owner: 10Jean-Frédéric)
[22:04:15] <wikibugs>	 (03PS2) 10Jean-Frédéric: Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111
[22:05:05] <wikibugs>	 (03PS3) 10Jean-Frédéric: Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111
[22:07:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 (owner: 10Jean-Frédéric)
[22:20:03] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[22:25:18] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[22:35:03] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[22:50:03] <wmcs-alerts>	 FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-48 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess
[22:55:12] <logmsgbot_cloud>	 !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0)
[23:10:24] <wikibugs>	 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11192160 (...
[23:17:54] <jinxer-wm>	 RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning