[00:44:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:44:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:21:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [05:06:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:52:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:32:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:30:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:55:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-19 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:37:50] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [12:38:19] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.006 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [12:44:05] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965 (10dcaro) 03NEW p:05Triage→03High [12:46:02] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10638604 (10dcaro) I can see some increase of stuck processes in a bunch of workers: https://grafana.wmcloud.org/d/3jhWxB8Vk/... [12:49:51] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10638605 (10dcaro) Logging in to the bastion as my user and wm-lol tool, looks ok, so not general breakdown: ` tools.wm-lol@t... [12:49:53] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 10.073 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [12:50:29] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10638607 (10dcaro) Some of the workers recovered: {F58834978} And the page got resolved by itself [12:54:01] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10638608 (10dcaro) tools-static-15 got stuck also, probably due to the nfs hiccup: ` root@tools-static-15:~# ps aux | grep D U... [12:54:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10638609 (10dcaro) It got unstuck by itself: ` root@tools-static-15:~# ps aux | grep D USER PID %CPU %MEM VSZ RSS... [12:55:12] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965#10638610 (10dcaro) And the alert went away :), so I'll keep this task open until monday, but the incident seems resolved (by t... [13:10:20] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:11:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:36:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:06:04] 10Tools: ipcheck 504 Gateway Time-out - https://phabricator.wikimedia.org/T387947#10638787 (10taavi) [15:14:09] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16,tools-k8s-worker-nfs-34,tools-k8s-worker-nfs-77 (T388965) [15:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:14:14] T388965: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965 [15:14:23] !log dcaro@urcuchillay tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-16,tools-k8s-worker-nfs-34,tools-k8s-worker-nfs-77 (T388965) [15:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:14:31] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-77 (T388965) [15:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:16:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:21:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:26:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-72 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:31:55] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-77 (T388965) [15:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:32:00] T388965: 2025-03-15 Tools NFS hiccup - https://phabricator.wikimedia.org/T388965 [16:22:41] 10Tools: https://linksearch.toolforge.org/ leads to a 404 page - https://phabricator.wikimedia.org/T265381#10638830 (10Superyetkin) 05Resolved→03Open The pages gives HTTP 403 error now. [17:06:50] 06cloud-services-team, 10Toolforge: [webservice] should have more easily understandable error messages when run as a non-tool user - https://phabricator.wikimedia.org/T360478#10638862 (10Bugreporter) Currently the error reads: ` Traceback (most recent call last): File "/usr/lib/python3/dist-packages/toolsws/... [17:37:32] (03open) 10lucaswerkmeister: Add rudimentary index page [toolforge-repos/ls-long-sparql] - 10https://gitlab.wikimedia.org/toolforge-repos/ls-long-sparql/-/merge_requests/1 [17:45:40] (03update) 10lucaswerkmeister: Add rudimentary index page [toolforge-repos/ls-long-sparql] - 10https://gitlab.wikimedia.org/toolforge-repos/ls-long-sparql/-/merge_requests/1 [20:24:16] 06cloud-services-team, 10Data-Services: Errors in 'show table status' in dewiki_p - https://phabricator.wikimedia.org/T388982#10639037 (10Jdforrester-WMF) [20:24:46] 06cloud-services-team, 10Data-Services: Drop views of module_deps tables - https://phabricator.wikimedia.org/T388982#10639038 (10Jdforrester-WMF)