[00:13:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [00:32:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [01:41:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [01:42:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [01:46:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [01:47:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [02:00:45] 10PAWS: [bug]  - https://phabricator.wikimedia.org/T400580#11037998 (10Pppery) →14Duplicate dup:03T400542 [02:00:47] 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038000 (10Pppery) [03:02:25] 06cloud-services-team, 10Toolforge: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223#11038010 (10Danilo) I found something interesting. I was seeing the pods in tools-k8s-worker-nfs-12, the node where wmopbot and stashbot are, and I noted the pods [[... [03:33:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:18:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [05:44:15] 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038076 (10A_smart_kitten) [05:49:40] 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038087 (10Octahedron80) I also face this very same problem though. [05:53:09] 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038089 (10Octahedron80) p:05Triage→03Medium [05:59:46] 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038098 (10Pppery) p:05Medium→03Triage Priority should generally not be set unless you or your team plan to work on fixing the task. See https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities [06:02:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:23:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:28:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:32:56] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:42:56] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [07:03:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:18:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:23:56] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [07:28:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:05:15] 10cloud-services-team (FY2025/26-Q1), 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038262 (10taavi) a:03taavi It seems like there is an individual PAWS worker (`paws-127b-rpchztfjt2jb-node-1`) that was having trouble talking to the NFS server hosting user home director... [08:11:24] 10cloud-services-team (FY2025/26-Q1), 10PAWS: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542#11038268 (10taavi) 05Open→03Resolved Worker fixed by a reboot. Will investigate more if this happens again. [08:33:52] Change on 12wikitech.wikimedia.org a page Help:Toolforge/My first Django OAuth tool was modified, changed by Zache link https://wikitech.wikimedia.org/w/index.php?diff=2326960 edit summary: changed OAuth "callback" URL from http://127.0.0.1:8080/ to //127.0.0.1:8080/ as there is now url validation [08:43:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:45:28] FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:52:13] (03CR) 10Majavah: [C:03+2] build: Updating npm dependencies [labs/striker] - 10https://gerrit.wikimedia.org/r/1173004 (owner: 10Libraryupgrader) [09:15:41] FIRING: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [09:15:41] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [09:20:41] RESOLVED: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [09:20:41] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [09:21:00] FIRING: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [09:25:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [09:30:56] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11038398 (10Multichill) Thanks! Looking good: ` tools.multichill@tools-bastion-12:~$ tool... [09:33:33] !log dcaro@acme admin-monitoring START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:33:50] !log dcaro@acme admin-monitoring END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:34:01] !log dcaro@acme admin-monitoring START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:37:41] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11038443 (10Multichill) Did just get this error, but second try gave normal output. Might... [09:39:03] !log dcaro@acme admin-monitoring END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) [09:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:41:15] !log dcaro@acme admin-monitoring START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:41:23] !log dcaro@acme admin-monitoring END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [09:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:41:27] !log dcaro@acme admin-monitoring START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:44:02] !log dcaro@acme admin-monitoring END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:51:39] !log dcaro@acme admin-monitoring START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:51:48] !log dcaro@acme admin-monitoring END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [09:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [09:51:51] !log dcaro@acme admin-monitoring START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [10:00:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11038542 (10dcaro) @Multichill yep, quite likely, can you open a new task for the quota b... [10:01:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 22), 07Kubernetes: Unable to load Toolforge job: ERROR: TjfCliError: Unknown error (403 Client Error: Forbidden for url - https://phabricator.wikimedia.org/T399417#11038544 (10dcaro) You can use the project https://phabricator.wikimedia.org/project/mana... [11:10:37] 06cloud-services-team, 10wikitech.wikimedia.org, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968#11038694 (10BTullis) This is now working. {F65685849,width=50%} {F65685854,width=50%} [11:40:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.06%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [11:53:58] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T396933#11038854 (10taavi) a:03taavi [11:59:08] 06cloud-services-team, 10Toolforge: New upstream release for Pywikibot - https://phabricator.wikimedia.org/T396933#11038864 (10taavi) 05Open→03Resolved [12:26:09] (03update) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/42 [12:52:00] !log dcaro@acme admin-monitoring END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [12:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring/SAL [13:00:22] 06cloud-services-team, 10wikitech.wikimedia.org, 10Data-Platform-SRE (2025.07.05 - 2025.07.25): wikitech-static: resume daily dumps - https://phabricator.wikimedia.org/T398968#11039089 (10BTullis) 05Open→03Resolved [13:00:31] (03CR) 10Jforrester: "check experimental" [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1151809 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:02:26] (03CR) 10Jforrester: "check experimental" [labs/tools/guc] - 10https://gerrit.wikimedia.org/r/1171198 (owner: 10L10n-bot) [13:12:22] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742#11039163 (10dcaro) Extended the open files limit for neutron-metadata-agent: ` root@cloudnet1006:~# systemctl show neutron-metadata-agent.service | grep LimitNOFIL... [13:14:25] (03PS1) 10Krinkle: vendor: Fix return type deprecation warning on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) [13:14:46] (03CR) 10CI reject: [V:04-1] vendor: Fix return type deprecation warning on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:17:15] (03PS1) 10Krinkle: build: Update php-parallel-lint to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173384 (https://phabricator.wikimedia.org/T395164) [13:17:35] (03CR) 10CI reject: [V:04-1] build: Update php-parallel-lint to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173384 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:18:27] (03PS2) 10Krinkle: vendor: Fix return type deprecation warning on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) [13:18:33] (03PS3) 10Krinkle: vendor: Fix return type deprecation warning on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) [13:18:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [13:18:54] (03PS2) 10Krinkle: build: Update php-parallel-lint to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173384 (https://phabricator.wikimedia.org/T395164) [13:19:18] (03PS4) 10Krinkle: vendor: Update ulrichsg/getopt-php to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) [13:19:22] (03PS3) 10Krinkle: build: Update php-parallel-lint to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173384 (https://phabricator.wikimedia.org/T395164) [13:19:36] (03CR) 10Krinkle: [C:03+2] vendor: Update ulrichsg/getopt-php to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:19:41] (03CR) 10Krinkle: [C:03+2] build: Update php-parallel-lint to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173384 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:20:14] (03Merged) 10jenkins-bot: vendor: Update ulrichsg/getopt-php to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173380 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:20:15] (03Merged) 10jenkins-bot: build: Update php-parallel-lint to fix PHP 8.4 deprecation warning [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173384 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:24:56] (03PS1) 10Jforrester: build: Upgrade phan to latest, move code out of root for easier testing, actually pass it [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 [13:24:56] (03PS1) 10Jforrester: build: Install MediaWiki codesniffer and make pass [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173390 [13:26:30] RESOLVED: NovafullstackSustainedFailures: Novafullstack tests have been failing for more than 5hours in eqiad - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/NovafullstackSustainedFailures - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-nova-fullstack?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DNovafullstackSustainedFailures [13:26:40] (03PS1) 10Krinkle: build: Update phan to fix crash on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173391 (https://phabricator.wikimedia.org/T395164) [13:27:29] (03CR) 10Krinkle: "check experimental" [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173391 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:27:40] (03CR) 10Krinkle: [C:03+2] "Fixed php84 crash at https://gerrit.wikimedia.org/r/c/labs/countervandalism/stillalive/+/1173391" [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1151809 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:28:27] (03CR) 10Krinkle: [C:03+2] build: Update phan to fix crash on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173391 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:28:51] (03Merged) 10jenkins-bot: build: Update phan to fix crash on PHP 8.4 [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173391 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:30:03] (03CR) 10Jforrester: "Thanks! Note that labs/countervandalism/cvn-api also has this issue." [labs/countervandalism/stillalive] - 10https://gerrit.wikimedia.org/r/1173391 (https://phabricator.wikimedia.org/T395164) (owner: 10Krinkle) [13:30:25] (03PS2) 10Krinkle: build: Upgrade phan to latest, move code out of root for easier testing, actually pass it [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 (owner: 10Jforrester) [13:30:30] (03CR) 10Krinkle: [C:03+2] build: Upgrade phan to latest, move code out of root for easier testing, actually pass it [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 (owner: 10Jforrester) [13:30:34] (03PS2) 10Jforrester: build: Install MediaWiki codesniffer and make pass [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173390 [13:31:05] (03Merged) 10jenkins-bot: build: Upgrade phan to latest, move code out of root for easier testing, actually pass it [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 (owner: 10Jforrester) [13:32:27] (03CR) 10Jforrester: "check experimental" [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 (owner: 10Jforrester) [13:41:47] 10Toolforge (Toolforge iteration 22): [jobs-cli,builds-cli,toolforge-cli,components-cli,envvars-cli] move the packaging scripts to bookworm - https://phabricator.wikimedia.org/T400616 (10dcaro) 03NEW [13:53:40] (03open) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/116 (https://phabricator.wikimedia.org/T400616) [13:54:00] (03update) 10dcaro: api: fix default probe [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/185 [14:11:16] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install clouddb102[2-5] - https://phabricator.wikimedia.org/T393733#11039370 (10fnegri) I think the partman recipe is incompatible with the new servers, I'll look into it. [14:17:57] (03approved) 10dcaro: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/116 (https://phabricator.wikimedia.org/T400616) (owner: 10raymond-ndibe) [14:18:03] (03update) 10dcaro: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/116 (https://phabricator.wikimedia.org/T400616) (owner: 10raymond-ndibe) [14:29:43] (03approved) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/116 (https://phabricator.wikimedia.org/T400616) [14:31:28] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on toolsbeta-puppetserver-1 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [14:32:58] RESOLVED: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance toolsbeta-harbor-2 in project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [15:02:14] 10Toolforge (Toolforge iteration 22), 13Patch-For-Review: [jobs-cli,builds-cli,toolforge-cli,components-cli,envvars-cli] move the packaging scripts to bookworm - https://phabricator.wikimedia.org/T400616#11039633 (10Lucas_Werkmeister_WMDE) [15:15:28] RESOLVED: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.34%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [15:16:28] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-docker-imagebuilder-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:21:28] FIRING: [5x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:26:28] FIRING: [9x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:31:28] FIRING: [19x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:33:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [15:36:28] FIRING: [25x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [15:39:03] 06cloud-services-team, 10Toolforge: tools-login intermittently has broken networking? - https://phabricator.wikimedia.org/T400502#11039798 (10dcaro) I see some sessions being closed, and a bunch of tries from random ips with random users. A quick check does not show those to be specially high (the brute-force... [15:45:41] Change on 12wikitech.wikimedia.org a page Help:Toolforge/My first Django OAuth tool was modified, changed by Zache link https://wikitech.wikimedia.org/w/index.php?diff=2327184 edit summary: Undo revision [[Special:Diff/2326960|2326960]] by [[Special:Contributions/Zache|Zache]] ([[User talk:Zache|talk]]) rv, incorrect fix [15:46:18] (03approved) 10dcaro: Use logging multi-pod fix moved to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/179 (https://phabricator.wikimedia.org/T398647) (owner: 10taavi) [15:47:33] 06cloud-services-team, 10Toolforge: tools-login intermittently has broken networking? - https://phabricator.wikimedia.org/T400502#11039829 (10DamianZaremba) Currently I'm not experiencing issues so this was likely transitive. I do use `fabric` for deploying configs and am currently actively doing some work, s... [15:47:57] 06cloud-services-team, 10Toolforge: tools-login intermittently has broken networking? - https://phabricator.wikimedia.org/T400502#11039832 (10DamianZaremba) 05Open→03Resolved a:03DamianZaremba [16:10:30] 06cloud-services-team, 06DC-Ops, 06Infrastructure-Foundations: kernel message: SGX disabled by BIOS - https://phabricator.wikimedia.org/T379351#11039899 (10fnegri) 05Open→03Resolved a:03fnegri Resolving, the message itself is harmless, and it is no longer triggering alerts after it was added to the... [16:11:58] RESOLVED: [12x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-docker-imagebuilder-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [16:24:12] david-caro closed https://github.com/toolforge/paws/pull/489 [16:27:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.13%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [16:28:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [16:37:28] RESOLVED: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.13%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [16:46:46] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Persistence, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Create wiki replicas views for globaljsonlinks tables - https://phabricator.wikimedia.org/T387419#11040008 (10BTullis) [16:46:54] 10Cloud Services Proposals, 06cloud-services-team, 10Data-Services, 06Data-Persistence, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Decision request - Who runs wikireplicas cookbooks - https://phabricator.wikimedia.org/T382607#11040018 (10BTullis) [16:55:26] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: MaxConntrack Max conntrack at 95.11% on cloudvirt1067:9100 - https://phabricator.wikimedia.org/T399050#11040212 (10fnegri) [16:56:43] 10cloud-services-team (FY2025/26-Q1): Create WMCS offboarding checklist - https://phabricator.wikimedia.org/T398972#11040231 (10fnegri) [17:00:22] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS: wp1-db-server trove DB instance in error - https://phabricator.wikimedia.org/T399464#11040248 (10fnegri) [17:02:25] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06Moderator-Tools-Team: Swift container endpoints are unavailable - https://phabricator.wikimedia.org/T399481#11040321 (10fnegri) [17:03:28] FIRING: NfsAlmostFull: The NFS drive is over 85% capacity (currently 85.09%) at host paws-nfs-1 in project paws - https://prometheus-alerts.wmcloud.org/?q=alertname%3DNfsAlmostFull [17:03:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:35:29] (03update) 10dcaro: api: allow protocol to be specified for ports [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/186 [17:35:46] (03update) 10dcaro: Query logs from Loki [repos/cloud/toolforge/jobs-api] (taavi/logging) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/180 (https://phabricator.wikimedia.org/T398645) (owner: 10taavi) [17:35:47] (03update) 10dcaro: Query logs from Loki [repos/cloud/toolforge/jobs-api] (taavi/logging) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/180 (https://phabricator.wikimedia.org/T398645) (owner: 10taavi) [17:36:02] (03update) 10dcaro: Use logging multi-pod fix moved to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/179 (https://phabricator.wikimedia.org/T398647) (owner: 10taavi) [17:36:04] (03update) 10dcaro: Use logging multi-pod fix moved to toolforge-weld [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/179 (https://phabricator.wikimedia.org/T398647) (owner: 10taavi) [17:38:51] (03update) 10dcaro: cloudinfra: Cleanup Puppetserver security group [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/253 (owner: 10taavi) [17:43:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:43:50] (03CR) 10David Caro: [C:03+1] "LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1172350 (owner: 10FNegri) [18:36:34] (03update) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1 [18:43:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [18:48:34] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [18:56:58] (03merge) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/116 (https://phabricator.wikimedia.org/T400616) [19:03:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:08:35] FIRING: DiskSpace: Disk space cloudbackup1002-dev:9100:/srv/cinder-backups 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:08:56] FIRING: SystemdUnitDown: The service unit backup_cinder_volumes.service is in failed status on host cloudbackup1002-dev. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [19:09:09] (03update) 10vriaa: Draft: Basic banner implementation [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/1 [19:11:35] (03CR) 10Krinkle: [C:03+2] "This fails locally for me, but passes in CI." [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 (owner: 10Jforrester) [19:16:48] (03PS1) 10Krinkle: build: Commit missing .phan/config.php file [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173454 [19:16:57] (03CR) 10Krinkle: [C:03+2] "Fixed https://gerrit.wikimedia.org/r/c/labs/tools/wikiinfo/+/1173454." [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173389 (owner: 10Jforrester) [19:17:13] (03CR) 10Krinkle: [C:03+2] build: Commit missing .phan/config.php file [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173454 (owner: 10Krinkle) [19:17:54] (03Merged) 10jenkins-bot: build: Commit missing .phan/config.php file [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173454 (owner: 10Krinkle) [19:18:14] PROBLEM - Disk space on cloudbackup1002-dev is CRITICAL: DISK CRITICAL - free space: /srv/cinder-backups 0MiB (0% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cloudbackup1002-dev&var-datasource=eqiad+prometheus/ops [19:21:30] (03PS3) 10Krinkle: build: Install MediaWiki codesniffer and make pass [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173390 (owner: 10Jforrester) [19:25:08] (03open) 10raymond-ndibe: d/changelog: bump to 16.1.16 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/117 (https://phabricator.wikimedia.org/T400616) [19:25:52] (03open) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/113 (https://phabricator.wikimedia.org/T400616) [19:31:12] (03CR) 10Jforrester: "Oh, blah, yes. Thank you!" [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173454 (owner: 10Krinkle) [19:31:32] (03CR) 10Jforrester: build: Commit missing .phan/config.php file (031 comment) [labs/tools/wikiinfo] - 10https://gerrit.wikimedia.org/r/1173454 (owner: 10Krinkle) [19:33:34] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:37:25] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [19:43:34] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:43:47] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [19:44:04] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [19:44:20] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [19:49:33] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [19:49:49] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [19:56:27] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [19:56:41] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [20:03:44] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [20:09:56] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [20:10:04] (03update) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/113 (https://phabricator.wikimedia.org/T400616) [20:13:10] (03update) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/113 (https://phabricator.wikimedia.org/T400616) [20:13:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:15:50] (03update) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/113 (https://phabricator.wikimedia.org/T400616) [20:23:41] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [20:23:56] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [20:24:49] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [20:24:50] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [20:25:02] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [20:25:16] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [20:28:26] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [20:28:41] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-cli [20:34:48] FIRING: PuppetFailure: Puppet has failed on cloudbackup1002-dev:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:34:55] 06cloud-services-team: PuppetFailure Puppet has failed on cloudbackup1002-dev:9100 - https://phabricator.wikimedia.org/T400650 (10phaultfinder) 03NEW [20:43:11] (03open) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/52 (https://phabricator.wikimedia.org/T400616) [21:01:21] (03open) 10raymond-ndibe: [cicd] replace bullseye with bookworm [repos/cloud/toolforge/envvars-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-cli/-/merge_requests/85 (https://phabricator.wikimedia.org/T400616) [21:03:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:03:56] FIRING: SystemdUnitDown: The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:04:07] 06cloud-services-team: SystemdUnitDown The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T400655 (10phaultfinder) 03NEW [21:58:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:53:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:08:34] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:08:49] FIRING: DiskSpace: Disk space cloudbackup1002-dev:9100:/srv/cinder-backups 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1002-dev - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:18:34] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:28:34] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:43:34] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:48:34] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-32 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess