[00:13:51] FIRING: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:03:34] andrew@cloudcumin1001 safe_reboot (PID 3795026) is awaiting input [01:15:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-7 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:18:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:20:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-7 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:22:01] andrew@cloudcumin1001 safe_reboot (PID 3795026) is awaiting input [01:38:52] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [01:52:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [01:52:10] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T397557 (10phaultfinder) 03NEW [03:08:33] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [03:16:59] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [03:23:51] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:28:51] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [03:57:00] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [04:22:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:27:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:28:05] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Engineering-Icebox, 06Privacy Engineering: Increased visibility in wiki-replicas for volunteers fighting vandals - https://phabricator.wikimedia.org/T284944#10935973 (10Aklapper) a:05lbowmaker→03None Removing inactive task assignee wh... [04:32:04] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [07:11:28] FIRING: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:50:58] FIRING: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [07:50:58] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [07:55:58] RESOLVED: HarborProbeUnknown: Harbor might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborProbeUnknown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborProbeUnknown [07:55:58] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [07:56:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:56:33] 06cloud-services-team, 10Cloud-VPS: tools-static.wmflabs.org is down - https://phabricator.wikimedia.org/T397560 (10GPSLeo) 03NEW [07:58:28] FIRING: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:18:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:28:51] RESOLVED: ProbeDown: Service tools-static-15:80 has failed probes (http_tools_static_wmflabs_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-static-15:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:26:54] !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:27:01] !log dcaro@acme tools END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [09:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:27:27] !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:34:33] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 tools-prometheus-8 stopped responding for a bit - https://phabricator.wikimedia.org/T397563 (10dcaro) 03NEW [09:43:36] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 tools-prometheus-8 stopped responding for a bit - https://phabricator.wikimedia.org/T397563#10936164 (10dcaro) Weird... for ssh to tools-prometheus-8 it seems it accepts my key, but then it fails it: ` Jun 21 09:37:55 tools-prometheus-8 sshd[146489]: Accep... [09:55:05] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 tools-prometheus-8 stopped responding for a bit - https://phabricator.wikimedia.org/T397563#10936179 (10dcaro) Something weird happened though when the memory + D processes spiked, in the logs at some point it starts failing to connect to the network, and... [10:00:50] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T397557#10936188 (10dcaro) Previous instance {T396934} [10:09:31] !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [10:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:14:46] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 Several correlated poetntially network issues during the night - https://phabricator.wikimedia.org/T397566 (10dcaro) 03NEW [11:49:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:59:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:19:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:35:16] (03open) 10tkarcher: Bugfix (specify JSON return format when requesting title) [toolforge-repos/erinnermich] - 10https://gitlab.wikimedia.org/toolforge-repos/erinnermich/-/merge_requests/5 [12:36:38] (03merge) 10tkarcher: Bugfix (specify JSON return format when requesting title) [toolforge-repos/erinnermich] - 10https://gitlab.wikimedia.org/toolforge-repos/erinnermich/-/merge_requests/5 [13:14:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [14:16:55] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1154831 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:04:29] (03PS2) 10Eugene233: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1154831 (https://phabricator.wikimedia.org/T396357) [15:04:43] (03CR) 10CI reject: [V:04-1] Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1154831 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:11:11] (03PS3) 10Eugene233: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1154831 (https://phabricator.wikimedia.org/T396357) [15:11:25] (03CR) 10CI reject: [V:04-1] Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1154831 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:14:43] (03PS1) 10Eugene233: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) [15:14:57] (03CR) 10CI reject: [V:04-1] Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:18:44] (03Abandoned) 10Eugene233: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1154831 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:22:25] (03PS2) 10Eugene233: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) [15:22:50] (03CR) 10CI reject: [V:04-1] Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:25:57] (03PS3) 10Eugene233: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) [15:27:39] (03CR) 10Eugene233: [C:03+2] "Basic checks work" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:28:21] (03Merged) 10jenkins-bot: Add support for ci tasks on tool [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1162370 (https://phabricator.wikimedia.org/T396357) (owner: 10Eugene233) [15:29:28] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157840 (https://phabricator.wikimedia.org/T390402) (owner: 10Bovimacoco) [15:29:40] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157764 (owner: 10Bovimacoco) [15:29:47] (03CR) 10CI reject: [V:04-1] T390397 Enforce Strict Typing. Bug=T390397 [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157764 (owner: 10Bovimacoco) [15:29:48] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157586 (owner: 10Essa237) [15:29:55] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1155646 (owner: 10NkwadaNora) [15:29:57] (03CR) 10CI reject: [V:04-1] [Fix] added a landing page [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1157586 (owner: 10Essa237) [15:30:01] (03CR) 10CI reject: [V:04-1] rearrange the location of some files [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1155646 (owner: 10NkwadaNora) [15:30:07] (03CR) 10Eugene233: "recheck" [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1152117 (owner: 10NkwadaNora) [15:56:57] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 Several correlated poetntially network issues during the night - https://phabricator.wikimedia.org/T397566#10936303 (10Andrew) I was draining (well, trying to drain) cloudvirts last night -- that doesn't usually cause network interruptions but it should be... [15:57:30] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-54, tools-k8s-worker-nfs-12 [16:09:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [16:09:18] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-54, tools-k8s-worker-nfs-12 [16:10:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment codfw1dev for all services [16:15:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment codfw1dev for all services [16:24:04] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [16:24:34] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:29:34] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [16:37:47] 06cloud-services-team, 10Toolforge, 10Wikidata-Query-Service: Problem with SPARQL endpoint response and crawling on Toolforge - https://phabricator.wikimedia.org/T397570 (10Fnielsen) 03NEW [18:23:11] 10Tool-gitlab-content: Support CORS in gitlab-content tool - https://phabricator.wikimedia.org/T397571 (10Msz2001) 03NEW [18:40:14] 06cloud-services-team, 10Cloud-VPS: tools-static.wmflabs.org is down - https://phabricator.wikimedia.org/T397560#10936351 (10JJMC89) 05Open→03Resolved [20:25:29] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance cvn-app10 in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun