[00:21:36] (03update) 10samwilson: Use HTTP client object from API, with User-Agent set [toolforge-repos/wsexport] - 10https://gitlab.wikimedia.org/toolforge-repos/wsexport/-/merge_requests/4 (https://phabricator.wikimedia.org/T403435) [00:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [00:48:16] (03update) 10samwilson: Use HTTP client object from API, with User-Agent set [toolforge-repos/wsexport] - 10https://gitlab.wikimedia.org/toolforge-repos/wsexport/-/merge_requests/4 (https://phabricator.wikimedia.org/T403435) [01:16:55] FIRING: [2x] ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 close to running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [01:21:55] FIRING: [2x] ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 close to running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [03:08:55] (03PS1) 10Andrew Bogott: Remove some imports that were removed upstream. [openstack/horizon/trove-dashboard] - 10https://gerrit.wikimedia.org/r/1188495 [03:09:14] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Remove some imports that were removed upstream. [openstack/horizon/trove-dashboard] - 10https://gerrit.wikimedia.org/r/1188495 (owner: 10Andrew Bogott) [04:01:41] (03PS1) 10Andrew Bogott: Further attempt to get that merge conflict resolved properly [openstack/horizon/trove-dashboard] - 10https://gerrit.wikimedia.org/r/1188500 [04:02:03] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Further attempt to get that merge conflict resolved properly [openstack/horizon/trove-dashboard] - 10https://gerrit.wikimedia.org/r/1188500 (owner: 10Andrew Bogott) [05:01:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [05:31:55] FIRING: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:01:55] RESOLVED: ToolforgeKubernetesCapacity: Kubernetes cluster k8s.tools.eqiad1.wikimedia.cloud:6443 in risk of running out of cpu - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesCapacity - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesCapacity [06:36:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [06:51:15] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [06:56:59] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [06:57:24] !log filippo@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-71, tools-k8s-worker-nfs-43, tools-k8s-worker-nfs-75 [07:11:44] !log filippo@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-71, tools-k8s-worker-nfs-43, tools-k8s-worker-nfs-75 [07:16:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:41:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:41:18] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:46:03] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [07:51:33] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [07:56:18] RESOLVED: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-43 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProce [07:57:05] (03update) 10dcaro: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 [08:00:47] (03update) 10dcaro: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 [08:04:57] (03update) 10dcaro: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 [08:13:49] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11183743 (10fgiunchedi) p:05Triage→03High [08:15:17] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11183744 (10fgiunchedi) re: nfs server update I'm reading https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Create_an_NFS_serv... [08:41:51] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11183806 (10fgiunchedi) Also as pointed out by @taavi we're looking at changing the VIP address, as opposed to VIP failover, because the new servers... [09:08:02] !log filippo@cloudcumin1001 testlabs START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:14:54] !log filippo@cloudcumin1001 testlabs END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:15:00] !log filippo@cloudcumin1001 testlabs START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:15:44] !log filippo@cloudcumin1001 testlabs END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:18:04] !log filippo@cloudcumin1001 testlabs START - Cookbook wmcs.openstack.cloudvirt.vm_console [09:30:55] !log filippo@cloudcumin1001 testlabs END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [09:35:29] !log filippo@cloudcumin1001 testlabs START - Cookbook wmcs.nfs.add_server [09:45:00] !log filippo@cloudcumin1001 testlabs END (PASS) - Cookbook wmcs.nfs.add_server (exit_code=0) [10:11:53] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [10:34:46] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11184317 (10fgiunchedi) I did some tests in `testlabs` today: 1. Created a `nfs-client-2` instance with Trixie for client testing. Mounts are prese... [10:51:02] 10VPS-project-Codesearch, 10m3api: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11184389 (10Ladsgroup) There is a `wmf_gitlab_group_projects` which takes a group and adds all of the projects using https://gitlab.wikimedia.org/groups/{group}/-/children.json (here https://gi... [11:27:37] 06cloud-services-team, 10Toolforge: [components-api] rebuilds un-changed images - https://phabricator.wikimedia.org/T403167#11184588 (10DamianZaremba) Hi @Raymond_Ndibe, Essentially what you describe is how you get into this state. I included it as an example along the lines of perhaps builds-api should be t... [11:37:18] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [11:45:07] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [11:46:57] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [11:59:37] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [12:00:56] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [12:15:25] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [12:24:02] 10Tool-global-search: Export as HTML table - https://phabricator.wikimedia.org/T404713 (10Reedy) 03NEW [12:24:40] 10Tool-global-search: Export as markdown table - https://phabricator.wikimedia.org/T404714 (10Reedy) 03NEW [12:24:56] 10Tool-global-search: Export as HTML table - https://phabricator.wikimedia.org/T404713#11184750 (10Reedy) [12:25:35] 10Tool-global-search: Export as markdown table - https://phabricator.wikimedia.org/T404714#11184753 (10Reedy) 05Open→03Invalid Apparently I can't spot the option (maybe because of the sorting order?)... [12:25:36] 10Tool-global-search: Export as HTML table - https://phabricator.wikimedia.org/T404713#11184755 (10Reedy) p:05Triage→03Low [12:34:21] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [12:44:54] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [12:50:48] 10Tool-archive-externa-links: Création de tableau de bord - https://phabricator.wikimedia.org/T399889#11184817 (10poro26) 05Open→03In progress [12:54:23] 10Tool-archive-externa-links: [Documentation] Réalisation d'une nouvelle capsule vidéo pour l'installation du script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404193#11184849 (10poro26) 05Open→03Resolved Lien de la vidéo réalisée : https://w.wiki/FJK$ [12:56:03] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [12:57:06] 10Tool-archive-externa-links: Création de tableau de bord - https://phabricator.wikimedia.org/T399889#11184859 (10poro26) 05In progress→03Resolved [12:57:42] (03merge) 10dcaro: package: upgrade all deps [repos/cloud/toolforge/envvars-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/64 [12:58:09] (03merge) 10dcaro: pre-commit: add check for openapi spec version bump [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/116 [13:02:00] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.156-20250916125822-74722783 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/968 (https://phabricator.wikimedia.org/T401388) [13:03:55] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [13:04:23] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [13:04:46] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: envvars-api: bump to 0.0.75-20250916125754-a88de155 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/969 (https://phabricator.wikimedia.org/T362869) [13:06:22] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [13:08:19] FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [13:08:38] FIRING: ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:13:38] RESOLVED: ProbeDown: Service toolsbeta-test-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_beta_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:20:52] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for tools-test-k8s-worker-nfs-5 [13:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:21:06] !log dcaro@acme toolsbeta END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-test-k8s-worker-nfs-5 [13:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:21:29] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.reboot for toolsbeta-test-k8s-worker-nfs-5 [13:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:22:38] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for toolsbeta-test-k8s-worker-nfs-5 [13:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:23:19] RESOLVED: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [13:25:55] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#11184952 (10fgiunchedi) [13:26:22] 06cloud-services-team, 10Cloud-VPS, 13Patch-For-Review, 07Security: Move cloud-wide root keys to the main puppet repo - https://phabricator.wikimedia.org/T317362#11184959 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done -- root-authorized-keys for cloud vps now lives in puppet.git [13:27:34] (03CR) 10Filippo Giunchedi: [C:03+1] inventory: Remove Bookworm based bastions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1187762 (https://phabricator.wikimedia.org/T392510) (owner: 10Majavah) [13:27:53] (03CR) 10Majavah: [C:03+2] inventory: Remove Bookworm based bastions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1187762 (https://phabricator.wikimedia.org/T392510) (owner: 10Majavah) [13:28:14] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-bastion-12 [13:29:03] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [13:29:09] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-bastion-12 [13:31:43] (03Merged) 10jenkins-bot: inventory: Remove Bookworm based bastions [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1187762 (https://phabricator.wikimedia.org/T392510) (owner: 10Majavah) [13:31:52] !log dcaro@acme toolsbeta START - Cookbook wmcs.vps.instance.force_reboot vm toolsbeta-test-k8s-worker-nfs-5 (cluster eqiad1, project toolsbeta) [13:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:31:57] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm toolsbeta-test-k8s-worker-nfs-5 (cluster eqiad1, project toolsbeta) [13:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:32:10] !log taavi@cloudcumin1001 bastion START - Cookbook wmcs.vps.remove_instance for instance bastion-eqiad1-03 [13:32:25] !log taavi@cloudcumin1001 bastion END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance bastion-eqiad1-03 [13:38:35] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.worker.depool_and_remove_node for host toolsbeta-test-k8s-worker-nfs-5 [13:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:40:15] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.worker.depool_and_remove_node (exit_code=0) for host toolsbeta-test-k8s-worker-nfs-5 [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:40:51] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721 (10dcaro) 03NEW [13:40:55] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185071 (10dcaro) p:05Triage→03High [13:41:01] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185073 (10dcaro) 05Open→03In progress [13:41:42] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster (T404721) [13:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:41:46] T404721: [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721 [13:42:16] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185076 (10dcaro) Deleted with: ` dcaro@acme$ wmcs-cookbooks wmcs.toolforge.k8s.worker.depool_and_remove_node --hostname-to-remove tools... [13:44:48] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [13:46:28] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-worker-nfs-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:48:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:50:29] 06cloud-services-team, 10Cloud-VPS: wmf-auto-restart can get wedged on nfs4 mounts even when the filesystem is excluded - https://phabricator.wikimedia.org/T404322#11185126 (10fgiunchedi) 05Open→03Invalid Will address as part of {T404584} [13:51:28] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-test-k8s-worker-nfs-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:53:19] !log dcaro@acme toolsbeta END (FAIL) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=99) for a worker-nfs role in the toolsbeta cluster [13:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:55:11] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185162 (10dcaro) It failed adding the new node with prefilght checks: ` ----- OUTPUT of 'sudo -i kubeadm ...16f541ca6dd18704' -----... [13:55:18] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [13:59:47] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:00:16] 06cloud-services-team, 10Toolforge (Toolforge iteration 24): Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584#11185182 (10Andrew) That plan looks good to me. I haven't tested the add_server cookbook in a long time so I'm glad it's still working. This is def... [14:00:53] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [14:01:49] 06cloud-services-team, 10Data-Services, 06Data-Persistence, 06Data-Platform-SRE: Decide how to use the new clouddb hosts (clouddb102[2-5]) - https://phabricator.wikimedia.org/T401295#11185210 (10akosiaris) 05Open→03Stalled Setting to stalled, while we figure out the exact details of this one. [14:02:49] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [14:08:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:09:48] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-bastion-13 [14:09:58] !log dcaro@acme toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-test-k8s-worker-nfs-11 (T404721) [14:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:10:02] T404721: [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721 [14:10:43] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-bastion-13 [14:10:58] !log taavi@cloudcumin1001 bastion START - Cookbook wmcs.vps.remove_instance for instance bastion-eqiad1-04 [14:11:11] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-test-k8s-worker-nfs-11 (T404721) [14:11:13] !log taavi@cloudcumin1001 bastion END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance bastion-eqiad1-04 [14:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:11:21] 06cloud-services-team, 10Toolforge, 07IPv6, 13Patch-For-Review: Upgrade Toolforge bastions to Trixie and enable IPv6 - https://phabricator.wikimedia.org/T392510#11185270 (10taavi) 05Open→03Resolved [14:11:33] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a worker-nfs role in the toolsbeta cluster (T404721) [14:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:11:37] 06cloud-services-team, 10Cloud-VPS (Debian Bullseye Deprecation), 07IPv6, 13Patch-For-Review: Refresh Cloud VPS bastions to run on Trixie and enable IPv6 - https://phabricator.wikimedia.org/T392689#11185274 (10taavi) 05Open→03Resolved [14:12:12] (03merge) 10taavi: volume-admission: bump to 0.0.72-20250915164649-3238fa82 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/966 (https://phabricator.wikimedia.org/T404438) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:12:28] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [14:15:27] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [14:15:34] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [14:15:51] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [14:16:54] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Mount /etc/openstack/clouds.yaml in mount-enabled containers - https://phabricator.wikimedia.org/T404438#11185293 (10taavi) 05Open→03Resolved [14:18:33] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [14:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:20:19] (03approved) 10dcaro: jobs-api: bump to 0.0.414-20250915172125-3b82d2c2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/967 (https://phabricator.wikimedia.org/T404176) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:20:24] (03update) 10dcaro: jobs-api: bump to 0.0.414-20250915172125-3b82d2c2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/967 (https://phabricator.wikimedia.org/T404176) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:21:15] (03merge) 10dcaro: jobs-api: bump to 0.0.414-20250915172125-3b82d2c2 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/967 (https://phabricator.wikimedia.org/T404176) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:21:20] (03merge) 10dcaro: package: upgrade deps [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/35 [14:21:29] (03merge) 10dcaro: package: upgrade dependencies [repos/cloud/toolforge/registry-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/registry-admission/-/merge_requests/29 [14:21:33] (03merge) 10dcaro: pacakage: bump dependencies [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/141 [14:21:39] (03approved) 10dcaro: toolforge_deploy_mr: also wait when pipeline is creating [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/272 [14:21:45] (03merge) 10dcaro: toolforge_deploy_mr: also wait when pipeline is creating [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/272 [14:24:41] !log dcaro@acme toolsbeta Added a new k8s worker-nfs toolsbeta-test-k8s-worker-nfs-11.toolsbeta.eqiad1.wikimedia.cloud to the cluster [14:24:42] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a worker-nfs role in the toolsbeta cluster [14:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [14:25:11] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: volume-admission: bump to 0.0.73-20250916142135-79fa734c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/970 (https://phabricator.wikimedia.org/T362869) [14:26:41] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: cloud: review lldp setup on hypervisors and VMs - https://phabricator.wikimedia.org/T304504#11185320 (10fgiunchedi) p:05High→03Low [14:27:14] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: registry-admission: bump to 0.0.66-20250916142141-810024bf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/971 (https://phabricator.wikimedia.org/T362869) [14:27:36] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185324 (10dcaro) 05In progress→03Resolved [14:29:44] 10Toolforge (Toolforge iteration 24): [tools,infra,k8s] scale up the cluster, specifically CPU - https://phabricator.wikimedia.org/T404726 (10dcaro) 03NEW [14:32:14] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: builds-api: bump to 0.0.199-20250916142147-5e8adc0f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/972 (https://phabricator.wikimedia.org/T362869) [14:32:52] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:32:53] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [14:33:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:33:09] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [14:33:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:36:48] PROBLEM - Host cloudcephosd1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:26] RECOVERY - Host cloudcephosd1017 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:38:51] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component builds-api [14:41:56] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [14:42:49] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component builds-api [14:45:53] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [14:47:00] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [15:07:17] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [15:08:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:10:33] (03PS1) 10Brouberol: kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 [15:10:49] (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol) [15:10:51] (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188811 (owner: 10Brouberol) [15:13:44] 06cloud-services-team, 10Toolforge: [components-api] Intermittent internal API failures / retry internal requests - https://phabricator.wikimedia.org/T403175#11185536 (10DamianZaremba) Another example in production ` { "deploy_id": "20250916-145825-hmaalsrpe6", "creation_time": "20250916-145825", "... [15:20:40] (03PS1) 10Brouberol: kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 [15:20:54] (03CR) 10Brouberol: [C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol) [15:21:00] (03CR) 10Brouberol: [V:03+2 C:03+2] kubernetes: add service secrets for airflow-dev/dse-k8s-eqiad [labs/private] - 10https://gerrit.wikimedia.org/r/1188816 (owner: 10Brouberol) [15:23:58] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.add_k8s_node for a ingress role in the toolsbeta cluster (T404721) [15:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:24:03] T404721: [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721 [15:24:09] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#11185565 (10taavi) a:05dcaro→03taavi [15:24:11] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 24), 07Epic, 05Goal: [infra] Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#11185567 (10taavi) a:05dcaro→03taavi [15:25:28] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11185576 (10taavi) a:05dcaro→03taavi [15:27:23] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11185597 (10taavi) 05Open→03Resolved Thanks. In that case I'm moving forward with retiring the anchient grid bastion VM. [15:27:28] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component builds-api [15:28:25] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: Toolforge: Replace all bastion with grid-less bookworm based bastion hosts - https://phabricator.wikimedia.org/T314665#11185603 (10taavi) I've shut down the bastion, will delete in a few days unless anything urgen... [15:29:58] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component builds-api [15:35:44] !log dcaro@acme toolsbeta Added a new k8s ingress toolsbeta-test-k8s-ingress-12.toolsbeta.eqiad1.wikimedia.cloud to the cluster [15:35:45] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.add_k8s_node (exit_code=0) for a ingress role in the toolsbeta cluster [15:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:42:08] !log dcaro@acme toolsbeta START - Cookbook wmcs.toolforge.k8s.worker.depool_and_remove_node for host toolsbeta-test-k8s-ingress-10 (T404721) [15:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:42:15] T404721: [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721 [15:42:21] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185692 (10dcaro) 05Resolved→03In progress [15:43:29] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.toolforge.k8s.worker.depool_and_remove_node (exit_code=0) for host toolsbeta-test-k8s-ingress-10 (T404721) [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:43:54] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component builds-api [15:44:25] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11185700 (10taavi) 05Stalled→03Open [15:44:29] (03PS1) 10Majavah: inventory: Remove tools-sgebastion-10 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188831 (https://phabricator.wikimedia.org/T314665) [15:45:14] FIRING: [4x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-10.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [15:46:30] (03CR) 10David Caro: [C:03+1] "🎉" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188831 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [15:46:44] (03CR) 10Majavah: [C:03+2] inventory: Remove tools-sgebastion-10 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188831 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [15:48:06] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-api [15:48:24] 06cloud-services-team, 10Toolforge: Update Toolforge client packages to build on Trixie only - https://phabricator.wikimedia.org/T404733 (10taavi) 03NEW [15:49:16] (03PS1) 10Majavah: aptly: Stop updating pre-Trixie repositories [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188832 (https://phabricator.wikimedia.org/T404733) [15:49:56] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component builds-api [15:50:13] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185766 (10dcaro) 05In progress→03Resolved [15:50:35] 10Toolforge (Toolforge iteration 24): [infra,k8s,toolsbeta] k8s worker node toolsbeta-test-k8s-worker-nfs-5 is failing to tail pods - https://phabricator.wikimedia.org/T404721#11185768 (10dcaro) ended up also scrubbing toolsbeta-test-k8s-ingress-10 [15:50:46] (03Merged) 10jenkins-bot: inventory: Remove tools-sgebastion-10 [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188831 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [15:54:21] (03CR) 10CI reject: [V:04-1] aptly: Stop updating pre-Trixie repositories [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188832 (https://phabricator.wikimedia.org/T404733) (owner: 10Majavah) [15:54:34] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11185793 (10taavi) [15:55:10] (03PS2) 10Majavah: aptly: Stop updating pre-Trixie repositories [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188832 (https://phabricator.wikimedia.org/T404733) [15:55:50] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component builds-api [15:59:21] (03CR) 10CI reject: [V:04-1] aptly: Stop updating pre-Trixie repositories [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1188832 (https://phabricator.wikimedia.org/T404733) (owner: 10Majavah) [15:59:26] (03approved) 10dcaro: builds-api: bump to 0.0.199-20250916142147-5e8adc0f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/972 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:59:29] (03merge) 10dcaro: builds-api: bump to 0.0.199-20250916142147-5e8adc0f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/972 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:59:36] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#11185806 (10taavi) [15:59:39] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component registry-admission [16:00:14] RESOLVED: [2x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-10.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [16:03:21] (03open) 10taavi: Retire login-buster address [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/80 (https://phabricator.wikimedia.org/T314665) [16:03:26] (03update) 10taavi: Retire login-buster address [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/80 (https://phabricator.wikimedia.org/T314665) [16:04:06] 10Cloud-VPS (Debian Bullseye Deprecation), 06Moderator-Tools-Team, 06The-Wikipedia-Library: wikilink: Replace deprecated Bullseye VM in Cloud VPS - https://phabricator.wikimedia.org/T402055#11185844 (10Samwalton9-WMF) [16:05:14] FIRING: [4x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-10.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [16:08:59] (03update) 10taavi: Retire login-buster address [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/80 (https://phabricator.wikimedia.org/T314665) [16:09:20] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component registry-admission [16:10:37] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component registry-admission [16:12:47] (03open) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [16:13:01] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Doing 😎): 'Fulltext' searches fail on the test Phabricator instance (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11185883 (10Dzahn) reverts do n... [16:14:10] 06cloud-services-team, 10Toolforge: [jobs-api] use `launcher` also for health-check script commands - https://phabricator.wikimedia.org/T403735#11185905 (10DamianZaremba) I started looking at this and the current logic isn't super clear. There are essentially 3 parts; 1. `_get_k8s_podtemplate` - This actuall... [16:15:14] RESOLVED: [4x] ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: toolsbeta-test-k8s-ingress-10.toolsbeta.eqiad1.wikimedia.cloud - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [16:17:39] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [components-api] reuse_from components are not explicitly re-created in jobs-api - https://phabricator.wikimedia.org/T403285#11185912 (10DamianZaremba) Any chance of getting https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_reque... [16:20:36] (03update) 10dcaro: Ensure reuse_from components are re-run [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/125 (https://phabricator.wikimedia.org/T403285) (owner: 10damian) [16:21:03] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component registry-admission [16:22:00] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component registry-admission [16:32:13] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component registry-admission [16:51:31] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [16:51:37] (03approved) 10dcaro: registry-admission: bump to 0.0.66-20250916142141-810024bf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/971 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:51:41] (03update) 10dcaro: registry-admission: bump to 0.0.66-20250916142141-810024bf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/971 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:52:06] (03merge) 10dcaro: registry-admission: bump to 0.0.66-20250916142141-810024bf [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/971 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:53:07] (03update) 10dcaro: Ensure reuse_from components are re-run [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/125 (https://phabricator.wikimedia.org/T403285) (owner: 10damian) [17:02:45] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component volume-admission [17:02:47] Guest204: Unknown project "dcaro@cloudcumin1001" [17:03:57] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component volume-admission [17:03:57] Guest204: Unknown project "dcaro@cloudcumin1001" [17:06:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [17:06:10] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission [17:06:20] Guest204: Unknown project "dcaro@cloudcumin1001" [17:06:43] (03approved) 10dcaro: volume-admission: bump to 0.0.73-20250916142135-79fa734c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/970 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:06:47] (03update) 10dcaro: volume-admission: bump to 0.0.73-20250916142135-79fa734c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/970 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:07:12] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component envvars-api [17:07:51] (03merge) 10dcaro: volume-admission: bump to 0.0.73-20250916142135-79fa734c [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/970 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:08:08] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component envvars-api [17:12:25] (03update) 10raymond-ndibe: [tool-config] handle unset and default arguments consistently [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/123 (https://phabricator.wikimedia.org/T401648 https://phabricator.wikimedia.org/T402572) [17:14:58] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component envvars-api [17:16:09] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component envvars-api [17:16:11] Guest204: Unknown project "dcaro@cloudcumin1001" [17:18:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-34 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:18:36] (03approved) 10dcaro: envvars-api: bump to 0.0.75-20250916125754-a88de155 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/969 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:18:40] (03update) 10dcaro: envvars-api: bump to 0.0.75-20250916125754-a88de155 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/969 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:18:51] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [17:18:52] Guest204: Unknown project "dcaro@cloudcumin1001" [17:19:01] (03merge) 10dcaro: envvars-api: bump to 0.0.75-20250916125754-a88de155 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/969 (https://phabricator.wikimedia.org/T362869) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:20:30] (03open) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129 [17:23:44] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [17:23:46] Guest204: Unknown project "dcaro@cloudcumin1001" [17:25:28] (03update) 10dcaro: cli: ignore replicas if not sent back from API [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/129 [17:25:58] (03update) 10raymond-ndibe: [tool-config] handle unset and default arguments consistently [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/123 (https://phabricator.wikimedia.org/T401648 https://phabricator.wikimedia.org/T402572) [17:27:11] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [17:27:33] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [17:27:33] Guest204: Unknown project "dcaro@cloudcumin1001" [17:29:47] (03approved) 10dcaro: Retire login-buster address [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/80 (https://phabricator.wikimedia.org/T314665) (owner: 10taavi) [17:30:15] (03update) 10dcaro: [tool home dir] revert change in dir permission [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/271 (https://phabricator.wikimedia.org/T403513) (owner: 10raymond-ndibe) [17:30:52] (03update) 10dcaro: build: Upgrade Poetry dependencies [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/60 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:32:36] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [17:32:37] Guest204: Unknown project "dcaro@cloudcumin1001" [17:37:45] (03approved) 10dcaro: components-api: bump to 0.0.156-20250916125822-74722783 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/968 (https://phabricator.wikimedia.org/T401388) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:37:49] (03update) 10dcaro: components-api: bump to 0.0.156-20250916125822-74722783 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/968 (https://phabricator.wikimedia.org/T401388) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:38:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [17:38:42] (03merge) 10dcaro: components-api: bump to 0.0.156-20250916125822-74722783 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/968 (https://phabricator.wikimedia.org/T401388) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:49:18] FIRING: KernelErrors: Server cloudcephosd1052 logged kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/KernelErrors - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-errors?orgId=1&var-instance=cloudcephosd1052 - https://alerts.wikimedia.org/?q=alertname%3DKernelErrors [17:49:23] 06cloud-services-team: KernelErrors Server cloudcephosd1052 logged kernel errors - https://phabricator.wikimedia.org/T404745 (10phaultfinder) 03NEW [17:56:08] 10VPS-project-Codesearch: T371191 - https://phabricator.wikimedia.org/T404746 (10ALFAN_SOFARI) 03NEW [17:56:56] 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747 (10Andrew) 03NEW [19:08:22] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [19:08:24] Guest204: Unknown project "andrew@cloudcumin1001" [19:09:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [19:09:45] Guest204: Unknown project "andrew@cloudcumin1001" [19:32:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:48:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:49:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [19:53:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [19:54:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [20:20:04] 06cloud-services-team, 10Cloud-VPS, 10Ceph: Review RAM allocation for cloudceph OSDs - https://phabricator.wikimedia.org/T404747#11187076 (10Andrew) Before: cloudcephosd2004-dev: total use 16GB, 8 OSDs total, 64GB RAM total After ` ceph config set osd osd_memory_target_autotune true ` cloudcephosd2004-d... [20:23:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:31:15] 10VPS-project-Codesearch, 10m3api: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11187130 (10LucasWerkmeister) https://gitlab.wikimedia.org/groups/repos/m3api/-/children.json works (extra `repos/`), I think that would be okay! (I just scheduled the `tmp-*` repositories for... [21:14:33] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Doing 😎): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187376... [21:17:39] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187381 (... [22:03:19] (03close) 10raymond-ndibe: [tool home dir] revert change in dir permission [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/271 (https://phabricator.wikimedia.org/T403513) [22:03:55] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [22:04:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [22:04:48] (03update) 10raymond-ndibe: [build] run pipeline cleanup per repo [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/142 (https://phabricator.wikimedia.org/T404157) [22:08:21] FIRING: MaintainKubeusersHang: maintain-kubeusers last finished run is 29.3M minutes old - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersHang [22:08:56] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [22:09:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [22:20:21] 10VPS-project-Codesearch, 10m3api: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11187608 (10Ladsgroup) If you can make the patch to write_config.py I'd appreciate it. Otherwise, I try to do it when I find some free time. [22:22:43] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [22:24:35] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [22:31:43] 10VPS-project-Codesearch, 10m3api: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11187666 (10LucasWerkmeister) Hm, I guess we need to pick a group first, I didn’t think about that yet 😅 I guess it could fall under CI & Development? Or a new group, like Pywikibot. But I’ll... [22:33:20] (03PS1) 10Lucas Werkmeister: devtools: add repos/m3api group [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1188896 (https://phabricator.wikimedia.org/T404517) [22:36:05] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [22:36:40] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [22:37:03] 10VPS-project-Codesearch, 10m3api, 13Patch-For-Review: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11187673 (10LucasWerkmeister) I also moved all the to-be-deleted `tmp-*` repositories to the `lucaswerkmeister/` namespace, to get them out of the `children.json` list imm... [22:37:11] (03CR) 10Lucas Werkmeister: "Disclaimer: I haven’t tested this whatsoever." [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1188896 (https://phabricator.wikimedia.org/T404517) (owner: 10Lucas Werkmeister) [22:38:03] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [22:38:36] (03CR) 10Ladsgroup: [C:03+2] "I test it in production 😊" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1188896 (https://phabricator.wikimedia.org/T404517) (owner: 10Lucas Werkmeister) [22:39:46] (03Merged) 10jenkins-bot: devtools: add repos/m3api group [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1188896 (https://phabricator.wikimedia.org/T404517) (owner: 10Lucas Werkmeister) [22:44:50] (03CR) 10Lucas Werkmeister: "https://bash.toolforge.org/quip/-RazVJkBffdvpiTrlWJk :P" [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1188896 (https://phabricator.wikimedia.org/T404517) (owner: 10Lucas Werkmeister) [22:54:43] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [22:55:47] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [23:05:31] (03update) 10don-vip: Update to OpenJDK 25 [toolforge-repos/spacemedia] - 10https://gitlab.wikimedia.org/toolforge-repos/spacemedia/-/merge_requests/5 [23:09:36] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187713 (... [23:12:28] 10VPS-project-Codesearch, 10m3api, 13Patch-For-Review: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11187719 (10Ladsgroup) 05Open→03Resolved a:03LucasWerkmeister https://codesearch.wmcloud.org/search/?q=m3api&files=&excludeFiles=&repos= [23:13:23] 10VPS-project-Codesearch, 10m3api, 13Patch-For-Review: Index m3api repositories in Codesearch - https://phabricator.wikimedia.org/T404517#11187723 (10LucasWerkmeister) \o/ thanks! [23:14:08] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187729 (... [23:18:03] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:18:28] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187733 (... [23:26:58] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187746 (... [23:31:12] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187755 (... [23:32:27] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187773 (... [23:34:45] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187774 (... [23:38:32] 10VPS-project-Phabricator, 06collaboration-services, 10Release-Engineering-Team (Radar): 'Fulltext' searches fail on test Phab instance due to ElasticSearch default config (PhutilAggregateException: All Fulltext Search hosts failed / CURLE_COULDNT_CONNECT) - https://phabricator.wikimedia.org/T403948#11187791 (... [23:42:07] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [23:43:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:43:08] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [23:46:44] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [23:48:11] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [23:54:44] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [23:55:18] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213