[00:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [01:56:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:30:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-78 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [02:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:34:35] 10Tool-refill: 404 Not Found - https://phabricator.wikimedia.org/T404936 (10GoingBatty) 03NEW [02:38:01] 10Tool-refill: refill.toolforge.org is 404 Not Found - https://phabricator.wikimedia.org/T404936#11192351 (10Bugreporter2) [02:41:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [02:46:27] 10Tool-refill: refill.toolforge.org is 404 Not Found - https://phabricator.wikimedia.org/T404936#11192365 (10Novem_Linguae) I'm able to reproduce. Smells like maybe the webserver went down. Let me see if I can SSH in and restart it. [02:55:38] 10Tool-refill: refill.toolforge.org is 404 Not Found - https://phabricator.wikimedia.org/T404936#11192366 (10Novem_Linguae) 05Open→03Resolved a:03Novem_Linguae It's working for me now after a restart. Marking as resolved. Thanks for reporting. {F66029456} [03:04:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:34:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:49:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:49:18] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [03:59:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [04:48:02] (03PS1) 10Harsh_Kushwaha07: Fix typo in success messages for sending labels and descriptions [labs/tools/weapon-of-mass-description] - 10https://gerrit.wikimedia.org/r/1189385 (https://phabricator.wikimedia.org/T201491) [05:19:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:04:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:24:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:38:04] (03merge) 10taavi: Retire login-buster address [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/80 (https://phabricator.wikimedia.org/T314665) [06:38:37] (03CR) 10Majavah: [C:03+2] views: Don't crash when encountering a proxy using IPv6 backends [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187393 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah) [06:38:39] (03CR) 10Majavah: [C:03+2] views: Support constructing URLs with v6 addresses [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187394 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah) [06:39:14] (03Merged) 10jenkins-bot: views: Don't crash when encountering a proxy using IPv6 backends [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187393 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah) [06:39:15] (03Merged) 10jenkins-bot: views: Support constructing URLs with v6 addresses [openstack/horizon/wmf-proxy-dashboard] - 10https://gerrit.wikimedia.org/r/1187394 (https://phabricator.wikimedia.org/T404302) (owner: 10Majavah) [06:47:24] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#11192625 (10taavi) 05In progress→03Resolved [06:47:36] 10cloud-services-team (FY2025/26-Q1), 14Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 24), 07Epic, 05Goal: [infra] Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#11192628 (10taavi) 05Open→03Resolved [06:48:07] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11192630 (10taavi) [06:50:06] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11192632 (10taavi) [06:54:28] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [07:04:28] RESOLVED: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [07:13:49] (03CR) 10Agamyasamuel: "recheck" [labs/tools/weapon-of-mass-description] - 10https://gerrit.wikimedia.org/r/1189385 (https://phabricator.wikimedia.org/T201491) (owner: 10Harsh_Kushwaha07) [07:14:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:16:30] 06cloud-services-team, 10Cloud-VPS (Debian Bullseye Deprecation): Replace all codfw1dev Bullseye VMs - https://phabricator.wikimedia.org/T401810#11192675 (10taavi) ` | 21b56c08-b463-4267-9bd2-44f17d4c9d22 | tools-codfw1dev-k8s-worker-2 | ACTIVE | VLAN/legacy=172.16.128.16... [07:18:55] (03CR) 10Stevemunene: Add a dummy Ceph user keys for the cephcsi plugin to use (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [07:38:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:41:06] (03PS2) 10Jean-Frédéric: Bump PHP dependencies in composer.json for PHP 8 compatibility [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 [07:56:06] (03CR) 10Jean-Frédéric: [C:03+2] "Self +2ing, as consistent with the previously approved my @lokal.profil@gmail.com" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 (owner: 10Jean-Frédéric) [07:57:51] (03Merged) 10jenkins-bot: Bump PHP dependencies in composer.json for PHP 8 compatibility [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189203 (owner: 10Jean-Frédéric) [07:58:53] (03PS4) 10Jean-Frédéric: Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 [08:08:43] (03update) 10dcaro: Expand reuse_from validation [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/126 (https://phabricator.wikimedia.org/T403287) (owner: 10damian) [08:09:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:09:49] (03update) 10dcaro: Add `source` support to ToolConfig [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/127 (https://phabricator.wikimedia.org/T402764) (owner: 10damian) [08:10:05] (03update) 10dcaro: _resolve_ref - use GitPython [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/128 (owner: 10damian) [08:11:57] (03approved) 10dcaro: package: upgrade dependencies [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/25 [08:12:00] (03merge) 10dcaro: package: upgrade dependencies [repos/cloud/toolforge/ingress-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-admission/-/merge_requests/25 [08:16:07] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: ingress-admission: bump to 0.0.65-20250918081214-08a03872 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/973 (https://phabricator.wikimedia.org/T362869) [08:34:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:34:53] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-38, tools-k8s-worker-nfs-26, tools-k8s-worker-nfs-3 [08:39:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [08:43:06] (03PS1) 10Filippo Giunchedi: README.md: update install instructions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189438 [08:45:04] (03CR) 10David Caro: [C:03+1] "LGTM, we might want to move eventually to something else like poetry or such" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189438 (owner: 10Filippo Giunchedi) [08:45:54] (03CR) 10David Caro: [C:03+1] "I think Francesco was playing with uv, might be another option too if that went well, in any case, future things to keep in mind :)" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189438 (owner: 10Filippo Giunchedi) [08:47:00] (03PS2) 10Filippo Giunchedi: README.md: update install instructions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189438 [08:47:09] (03CR) 10Filippo Giunchedi: [C:03+2] README.md: update install instructions (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189438 (owner: 10Filippo Giunchedi) [08:47:24] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] README.md: update install instructions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1189438 (owner: 10Filippo Giunchedi) [08:50:30] (03update) 10dcaro: Ensure reuse_from components are re-run [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/125 (https://phabricator.wikimedia.org/T403285) (owner: 10damian) [08:52:43] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-38, tools-k8s-worker-nfs-26, tools-k8s-worker-nfs-3 [08:54:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [09:07:36] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11192882 (10dcaro) Do you know of a way to find out what caused the memory spike? I suspect that it was not able to write in the query log the query that made it explode :/ [09:09:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-26 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [09:33:59] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers no stuck workers found [09:34:01] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot_stuck_workers (exit_code=0) no stuck workers found [09:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:36:17] !log dcaro@acme tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-55 [09:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [09:42:40] !log dcaro@acme tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-55 [09:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:05:15] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11193011 (10dcaro) Added a memory limit as @fgiunchedi suggested to palliate the crashes. [10:05:34] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11193013 (10dcaro) p:05Triage→03Medium [10:07:51] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11193034 (10dcaro) I added a couple new graphs to the toolforge global overview dashboard: https://grafana-rw.wmcloud.org/d/8GiwHDL4k/infra-kubernetes-cluster-overview {F660... [10:11:19] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [10:12:28] FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [10:20:04] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [10:49:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [10:50:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [10:56:08] 10PAWS, 10OpenRefine, 10Wikidata: OpenRefine on PAWS, cannot login to Wikidata - https://phabricator.wikimedia.org/T401092#11193123 (10DaxServer) This is due to an enforcement of sending a valid User-Agent header, which has since been resolved with the latest version: 3.9.5 [10:57:48] 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11193149 (10DaxServer) [10:58:18] 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11193150 (10DaxServer) p:05Triage→03Unbreak! [10:58:40] 10PAWS, 10OpenRefine, 10Wikidata: OpenRefine on PAWS, cannot login to Wikidata - https://phabricator.wikimedia.org/T401092#11193151 (10DaxServer) p:05Triage→03Unbreak! [11:00:40] 10PAWS, 10OpenRefine, 10Wikidata: OpenRefine on PAWS, cannot login to Wikidata - https://phabricator.wikimedia.org/T401092#11193156 (10DaxServer) [11:00:42] 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11193157 (10DaxServer) [11:04:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [11:05:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [11:07:51] 10PAWS, 10OpenRefine: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T388928#11193167 (10DaxServer) p:05Triage→03Unbreak! [11:08:13] 10PAWS, 10OpenRefine: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T388928#11193171 (10DaxServer) [11:08:20] 10PAWS, 10OpenRefine, 10Wikidata: OpenRefine on PAWS, cannot login to Wikidata - https://phabricator.wikimedia.org/T401092#11193174 (10DaxServer) →14Duplicate dup:03T388928 [11:08:36] 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11193176 (10DaxServer) [11:08:37] 10PAWS, 10OpenRefine, 10Wikidata: OpenRefine on PAWS, cannot login to Wikidata - https://phabricator.wikimedia.org/T401092#11193175 (10DaxServer) [11:08:59] 10PAWS, 10OpenRefine: New upstream release for OpenRefine - https://phabricator.wikimedia.org/T388928#11193177 (10DaxServer) [11:09:00] 10PAWS, 06Commons, 10OpenRefine: New upstream release for Wikimedia Commons Extension for OpenRefine - https://phabricator.wikimedia.org/T403780#11193178 (10DaxServer) [11:27:26] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [11:27:44] (03approved) 10dcaro: Ensure reuse_from components are re-run [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/125 (https://phabricator.wikimedia.org/T403285) (owner: 10damian) [11:27:48] (03merge) 10dcaro: Ensure reuse_from components are re-run [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/125 (https://phabricator.wikimedia.org/T403285) (owner: 10damian) [11:29:01] !log dcaro@acme tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [11:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:29:12] !log dcaro@acme tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [11:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:29:41] !log dcaro@acme tools START - Cookbook wmcs.vps.instance.force_reboot vm tools-prometheus-9 (cluster eqiad1, project tools) [11:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:29:45] !log dcaro@acme tools END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm tools-prometheus-9 (cluster eqiad1, project tools) [11:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:30:41] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.158-20250918112802-f7efa728 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/974 (https://phabricator.wikimedia.org/T403285) [11:32:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:35:54] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component ingress-admission [11:37:05] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [11:38:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:45:54] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component ingress-admission [11:47:09] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission [11:56:28] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission [12:31:29] (03update) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/44 [12:59:49] (03update) 10don-vip: Draft: DVIDS: incremental update service [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/4 [13:07:29] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11193540 (10fgiunchedi) >>! In T404199#11192882, @dcaro wrote: > Do you know of a way to find out what caused the memory spike? I suspect that it was not able to write in the qu... [13:34:33] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11193640 (10dcaro) p:05Triage→03High [13:35:10] (03approved) 10dcaro: ingress-admission: bump to 0.0.65-20250918081214-08a03872 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/973 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:35:14] (03merge) 10dcaro: ingress-admission: bump to 0.0.65-20250918081214-08a03872 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/973 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:35:20] (03update) 10dcaro: components-api: bump to 0.0.158-20250918112802-f7efa728 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/974 (https://phabricator.wikimedia.org/T403285) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:36:45] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [13:41:25] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [13:42:00] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [13:42:28] (03update) 10don-vip: Draft: DVIDS: incremental update service [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/4 [13:44:00] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) - https://phabricator.wikimedia.org/T362869#11193687 (10dcaro) [13:44:20] (03update) 10don-vip: DVIDS: incremental update service [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/4 [13:45:07] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) - https://phabricator.wikimedia.org/T362869#11193697 (10dcaro) 05In progress→03Resolved [13:46:53] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [13:51:55] 10VPS-project-Wikistats: Add mswikiquote to wikistats - https://phabricator.wikimedia.org/T404705#11193719 (10Dzahn) a:03Dzahn [13:56:56] 10VPS-project-Wikistats: Add mswikiquote to wikistats - https://phabricator.wikimedia.org/T404705#11193728 (10Dzahn) ` MariaDB [wikistats]> insert into wikiquotes (prefix, lang, loclang, method) select prefix,lang,loclang,method from wikipedias where prefix="ms"; root@wikistats-bookworm:/home/dzahn# /usr/bi... [13:57:13] 10VPS-project-Wikistats: Add mswikiquote to wikistats - https://phabricator.wikimedia.org/T404705#11193730 (10Dzahn) 05Open→03Resolved [14:02:32] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11193748 (10dcaro) 05Open→03In progress [14:04:06] 10Toolforge (Toolforge iteration 24): [jobs-api] handle non-passed arguments and defaults consistently - https://phabricator.wikimedia.org/T402569#11193751 (10dcaro) 05In progress→03Resolved [14:05:15] (03approved) 10dcaro: components-api: bump to 0.0.158-20250918112802-f7efa728 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/974 (https://phabricator.wikimedia.org/T403285) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:05:19] (03merge) 10dcaro: components-api: bump to 0.0.158-20250918112802-f7efa728 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/974 (https://phabricator.wikimedia.org/T403285) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:05:39] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [14:08:16] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [components-api] reuse_from components are not explicitly re-created in jobs-api - https://phabricator.wikimedia.org/T403285#11193759 (10dcaro) a:03DamianZaremba [14:08:27] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [components-api] reuse_from components are not explicitly re-created in jobs-api - https://phabricator.wikimedia.org/T403285#11193762 (10dcaro) 05Open→03Resolved [14:08:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [components-api] reuse_from components are not explicitly re-created in jobs-api - https://phabricator.wikimedia.org/T403285#11193767 (10dcaro) Up and running in prod :) [14:11:40] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [components-api] Validate `SourceReference` components point to `SourceBuild` components - https://phabricator.wikimedia.org/T403287#11193775 (10dcaro) a:03DamianZaremba [14:11:47] 06cloud-services-team, 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [components-api] Validate `SourceReference` components point to `SourceBuild` components - https://phabricator.wikimedia.org/T403287#11193778 (10dcaro) 05Open→03In progress [14:11:52] (03update) 10dcaro: Expand reuse_from validation [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/126 (https://phabricator.wikimedia.org/T403287) (owner: 10damian) [14:19:57] 10Cloud Services Proposals, 06cloud-services-team, 10Toolforge: DRAFT Decision request - Focus for improving lima-kilo developer experience - https://phabricator.wikimedia.org/T403051#11193852 (10dcaro) >>! In T403051#11173083, @fnegri wrote: > I like option 4, I think it has an additional Pro: > * if we set... [14:26:05] 06cloud-services-team, 10Toolforge: [infra,k8s] Move to kubernetes VAPs and drop kyverno - https://phabricator.wikimedia.org/T364293#11193897 (10dcaro) For mutating policies we might have to wait a few versions (in 1.34 is in beta): https://kubernetes.io/docs/reference/access-authn-authz/mutating-admission-pol... [14:27:57] (03merge) 10don-vip: DVIDS: incremental update service [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/4 [14:30:54] 06cloud-services-team, 10Toolforge: [infra,k8s] Move to kubernetes VAPs and drop kyverno - https://phabricator.wikimedia.org/T364293#11193935 (10dcaro) This is useful to test the expressions [14:31:37] 06cloud-services-team, 10Toolforge: [infra,k8s] Move to kubernetes VAPs and drop kyverno - https://phabricator.wikimedia.org/T364293#11193941 (10dcaro) I'll start adding it to jobs-api, as they don't need to be created on each namespace, instead they are cluster-wide and matching namespaces by labels. [14:33:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [14:42:34] (03CR) 10Brouberol: [C:03+1] Add a dummy Ceph user keys for the cephcsi plugin to use [labs/private] - 10https://gerrit.wikimedia.org/r/1189133 (https://phabricator.wikimedia.org/T404576) (owner: 10Stevemunene) [14:47:57] (03approved) 10dcaro: Expand reuse_from validation [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/126 (https://phabricator.wikimedia.org/T403287) (owner: 10damian) [14:54:53] (03update) 10dcaro: Add validated type for git urls [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 (owner: 10damian) [15:01:06] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [15:05:32] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): [components-api] reuse_from components are not explicitly re-created in jobs-api - https://phabricator.wikimedia.org/T403285#11194152 (10DamianZaremba) Thanks! [15:07:51] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [15:11:49] PROBLEM - Host cloudcephosd1022 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:19] RECOVERY - Host cloudcephosd1022 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:14:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [15:15:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [15:15:17] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [15:47:11] (03update) 10damian: Add validated type for git urls [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/121 [15:51:10] 10Cloud-VPS (Quota-requests): Increase gitlab-runners-staging volumes to 12 - https://phabricator.wikimedia.org/T404668#11194344 (10fgiunchedi) LGTM [15:57:12] 10Cloud Services Proposals, 06cloud-services-team, 10Cloud-VPS: Decision Request - How openstack projects relate to tofu-infra - https://phabricator.wikimedia.org/T385604#11194358 (10dcaro) 05Open→03Stalled a:05dcaro→03None Leaving it open and stalled, until we have capacity to act on whatever is de... [16:02:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [16:02:51] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [16:08:12] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.depool_and_destroy [16:11:58] 06cloud-services-team, 10Cloud-VPS: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862#11194468 (10taavi) a:03Andrew [16:23:11] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:24:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:51:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) [16:52:49] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [16:52:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [16:52:56] (03open) 10dcaro: resources: reduce the default cpu k8s request [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/215 [16:53:14] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [16:53:56] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [16:54:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.bootstrap_and_add [16:55:17] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) [17:15:35] (03CR) 10Lokal Profil: [C:03+2] Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 (owner: 10Jean-Frédéric) [17:15:38] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726#11194764 (10dcaro) In the team meeting from today we decided that we should first reduce the default cpu request according to the mean cpu usage per pod in the cluster (patch... [17:18:06] (03Merged) 10jenkins-bot: Switch to Python3.9 and Debian Bullesye as base image [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189111 (owner: 10Jean-Frédéric) [17:18:18] (03update) 10dcaro: resources: reduce the default cpu k8s request [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/215 [17:37:07] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Enable SSL in Trove MariaDB - Trixie MariaDB client requires SSL but SSL is not enabled in the Trove server - https://phabricator.wikimedia.org/T401861#11194896 (10DamianZaremba) Just for the record, I also hit this when doing some ad-hoc queries as the user d... [17:52:31] 06cloud-services-team, 14Cloud-VPS (Debian Buster Deprecation): Buster VMs in cloud-vps PKI project - https://phabricator.wikimedia.org/T405017 (10Andrew) 03NEW [17:52:49] 06cloud-services-team, 10Toolforge: [envvars] only mask secrets - https://phabricator.wikimedia.org/T405018 (10DamianZaremba) 03NEW [18:07:55] 06cloud-services-team, 10Toolforge: [envvars] scope to jobs/components - https://phabricator.wikimedia.org/T405022 (10DamianZaremba) 03NEW [18:18:33] 06cloud-services-team, 10PAWS: Support ssh to PAWS k8s workers - https://phabricator.wikimedia.org/T405023 (10Andrew) 03NEW [18:19:30] 06cloud-services-team, 10Toolforge: [envvars] ease revealing a secret - https://phabricator.wikimedia.org/T405024#11195233 (10DamianZaremba) [18:25:47] (03PS1) 10Andrew Bogott: Add dummy certs for paws VM access [labs/private] - 10https://gerrit.wikimedia.org/r/1189547 [18:48:22] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Increase cinder quota for PAWS project - https://phabricator.wikimedia.org/T405028 (10Andrew) 03NEW [18:48:37] !log andrew@cloudcumin1001 paws START - Cookbook wmcs.openstack.quota_increase [18:48:44] !log andrew@cloudcumin1001 paws END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [18:49:05] !log andrew@cloudcumin1001 paws START - Cookbook wmcs.openstack.quota_increase [18:49:11] !log andrew@cloudcumin1001 paws END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) [19:08:51] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Add dummy certs for paws VM access [labs/private] - 10https://gerrit.wikimedia.org/r/1189547 (owner: 10Andrew Bogott) [19:32:17] 06cloud-services-team, 10PAWS: Support ssh to PAWS k8s workers - https://phabricator.wikimedia.org/T405023#11195484 (10Andrew) 05Open→03Resolved Done, documentation at https://wikitech.wikimedia.org/wiki/PAWS/Admin#K8s_node_access [19:37:41] 06cloud-services-team, 10Cloud-VPS (Quota-requests): Increase cinder quota for PAWS project - https://phabricator.wikimedia.org/T405028#11195503 (10Andrew) 05Open→03Resolved [19:39:31] 06cloud-services-team, 10PAWS: Support ssh to PAWS k8s workers - https://phabricator.wikimedia.org/T405023#11195510 (10github-toolforge-bot) andrewbogott opened https://github.com/toolforge/paws/pull/498 [19:41:23] andrewbogott opened https://github.com/toolforge/paws/pull/498 [19:44:26] 06cloud-services-team, 10PAWS: Support ssh to PAWS k8s workers - https://phabricator.wikimedia.org/T405023#11195536 (10github-toolforge-bot) andrewbogott closed https://github.com/toolforge/paws/pull/498 [19:45:36] andrewbogott closed https://github.com/toolforge/paws/pull/498 [20:05:35] (03update) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/44 (owner: 10l10n-bot) [20:05:38] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/44 (owner: 10l10n-bot) [20:05:42] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/44 (owner: 10l10n-bot) [20:14:27] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862#11195581 (10Andrew) ` root@cloudcontrol1011:~# openstack loadbalancer list --project testlabs --os-cloud novaobserver +---------------+---------------+------------+-... [20:14:35] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review: Allow novaobserver to read Octavia data - https://phabricator.wikimedia.org/T404862#11195582 (10Andrew) 05Open→03Resolved [20:15:56] 06cloud-services-team, 10Toolforge: [jobs-api] - https://phabricator.wikimedia.org/T405036 (10DamianZaremba) 03NEW [20:20:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-11 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:25:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [20:25:12] 06cloud-services-team, 10Toolforge: [jobs-api] allow port configuration on non-continuous jobs - https://phabricator.wikimedia.org/T405036#11195648 (10DamianZaremba) [20:40:38] (03Abandoned) 10Andrew Bogott: Review access change [openstack/horizon/horizon] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1189297 (owner: 10Andrew Bogott) [21:16:56] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [21:17:22] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [21:20:18] (03PS1) 10Lokal Profil: handle missing wd_item values gracefully [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/1189582 (https://phabricator.wikimedia.org/T346681) [22:12:00] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [22:28:13] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [22:32:35] PROBLEM - Host cloudcephosd1042 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:03] RECOVERY - Host cloudcephosd1042 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [22:34:53] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [23:18:08] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5