[00:09:22] FIRING: HAProxyBackendUnavailable: HAProxy service radosgw-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:13:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [00:14:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service radosgw-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [00:40:42] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814#11165842 (10JJMC89) This tool was previously adopted by @Adithyak1997 in T209147. [01:11:46] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814#11165879 (10JJMC89) [01:13:14] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814#11165880 (10JJMC89) It looks like Adithyak1997 is no longer a maintainer. I've removed the credentials that I found. A Toolforge root may now add @Tenshi_Hinanawi as a maintainer and remove the committee. [02:04:32] (03open) 10raymond-ndibe: [build] run pipeline cleanup per repo [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/142 (https://phabricator.wikimedia.org/T404157) [02:08:38] (03update) 10raymond-ndibe: [build] run pipeline cleanup per repo [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/142 (https://phabricator.wikimedia.org/T404157) [02:23:52] (03update) 10raymond-ndibe: [build] run pipeline cleanup per repo [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/142 (https://phabricator.wikimedia.org/T404157) [02:26:43] (03update) 10raymond-ndibe: [build] run pipeline cleanup per repo [repos/cloud/toolforge/builds-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-api/-/merge_requests/142 (https://phabricator.wikimedia.org/T404157) [02:27:33] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [builds-api, maintain-harbor] fix build/image cleanup - https://phabricator.wikimedia.org/T404157#11165910 (10Raymond_Ndibe) 05Open→03In progress [02:53:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [03:09:07] (03update) 10vriaa: feat: Make editor responsive [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/21 [04:11:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:16:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [04:20:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [04:48:39] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814#11165949 (10Adithyak1997) @JJMC89 Thanks for adding me to this task. I hereby confirm I am not using the tool anymore. [06:01:54] 10VPS-project-Codesearch: Codesearch omits results in Phabricator format - https://phabricator.wikimedia.org/T348921#11165981 (10Alejano564) a:0307035938089_09054623972 [06:03:57] 10VPS-project-Codesearch: Opening link with #sectionAnchor in new tab doesn't jump to section - https://phabricator.wikimedia.org/T335839#11165988 (10Alejano564) p:05Triage→03Unbreak! a:03-akko [06:05:42] 10VPS-project-Codesearch: Opening link with #sectionAnchor in new tab doesn't jump to section - https://phabricator.wikimedia.org/T335839#11165993 (10Alejano564) a:05-akko→03None [06:07:59] 10VPS-project-Codesearch: Opening link with #sectionAnchor in new tab doesn't jump to section - https://phabricator.wikimedia.org/T335839#11165996 (10Alejano564) [06:24:07] 10VPS-project-Codesearch: Opening link with #sectionAnchor in new tab doesn't jump to section - https://phabricator.wikimedia.org/T335839#11165997 (10taavi) p:05Unbreak!→03Triage [06:30:01] 10VPS-project-Codesearch: Remove archived GitHub repos from codesearch - https://phabricator.wikimedia.org/T317989#11166002 (10taavi) [06:30:15] 10VPS-project-Codesearch: Codesearch omits results in Phabricator format - https://phabricator.wikimedia.org/T348921#11166005 (10taavi) a:0507035938089_09054623972→03None [06:51:36] 10VPS-project-Codesearch: Opening link with #sectionAnchor in new tab doesn't jump to section - https://phabricator.wikimedia.org/T335839#11166043 (10Aklapper) [08:14:22] 06cloud-services-team, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers - https://phabricator.wikimedia.org/T336845#11166242 (10fgiunchedi) 05Declined→03Open I stand corrected, `lsof` from `wmf-auto-r... [08:14:52] !log fnegri@cloudcumin1001 cloudinfra START - Cookbook wmcs.vps.add_user_to_project for user 'volans' in role 'member' [08:14:58] !log fnegri@cloudcumin1001 cloudinfra END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'volans' in role 'member' [08:15:36] !log fnegri@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.add_user_to_project for user 'volans' in role 'member' [08:15:42] !log fnegri@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'volans' in role 'member' [08:16:20] !log fnegri@cloudcumin1001 paws START - Cookbook wmcs.vps.add_user_to_project for user 'volans' in role 'member' [08:16:26] !log fnegri@cloudcumin1001 paws END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'volans' in role 'member' [08:26:00] 06cloud-services-team, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers - https://phabricator.wikimedia.org/T336845#11166258 (10MoritzMuehlenhoff) Yeah, that sounds good, we even have a Hiera option for t... [08:40:31] 10VPS-project-Codesearch: CodeSearch is unresponsive - https://phabricator.wikimedia.org/T404163 (10hashar) 03NEW [08:45:58] 10VPS-project-Codesearch: CodeSearch is unresponsive - https://phabricator.wikimedia.org/T404163#11166303 (10hashar) [08:46:36] 06cloud-services-team, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers - https://phabricator.wikimedia.org/T336845#11166307 (10fgiunchedi) >>! In T336845#8870600, @jbond wrote: >> The auto restarts are n... [08:51:25] 10VPS-project-Codesearch: CodeSearch is unresponsive - https://phabricator.wikimedia.org/T404163#11166312 (10Volans) p:05Triage→03High a:03Volans I'm looking into it [08:51:42] !log volans@cloudcumin1001 codesearch START - Cookbook wmcs.openstack.cloudvirt.vm_console [08:52:27] !log volans@cloudcumin1001 codesearch START - Cookbook wmcs.openstack.cloudvirt.vm_console [08:52:39] !log volans@cloudcumin1001 codesearch END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:52:45] 10VPS-project-Codesearch: CodeSearch is unresponsive (2025-09-10) - https://phabricator.wikimedia.org/T404163#11166315 (10A_smart_kitten) [08:53:40] !log volans@cloudcumin1001 codesearch END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [08:55:01] 10VPS-project-Codesearch: CodeSearch is unresponsive (2025-09-10) - https://phabricator.wikimedia.org/T404163#11166322 (10Volans) Both ssh and the VM console hangs, without giving any prompt. Forcing a VM restart. [09:02:14] 10VPS-project-Codesearch: CodeSearch is unresponsive (2025-09-10) - https://phabricator.wikimedia.org/T404163#11166339 (10hashar) 05Open→03Resolved It is restarting indeed and the web interface is reachable again. https://codesearch.wmcloud.org/_health/ has: extensions: starting up search: starting up... [09:13:55] 10VPS-project-Codesearch: CodeSearch is unresponsive (2025-09-10) - https://phabricator.wikimedia.org/T404163#11166348 (10Volans) Restart completed [09:23:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:26:06] 06cloud-services-team, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers - https://phabricator.wikimedia.org/T336845#11166359 (10fgiunchedi) >>! In T336845#11166258, @MoritzMuehlenhoff wrote: > Yeah, that... [09:28:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:32:28] (03open) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [09:44:22] 10cloud-services-team (FY2025/26-Q1), 10PAWS: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076#11166435 (10dcaro) 05Open→03Resolved a:03dcaro [09:49:13] 06cloud-services-team, 10Cloud-VPS: Add accounts(-dev).wmcloud.org to XFF allowlist - https://phabricator.wikimedia.org/T404172 (10stwalkerster) 03NEW [09:53:56] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:57:19] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [09:58:56] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/k8s-haproxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [09:58:57] 06cloud-services-team, 10Toolforge: Request to recreate replica.my.cnf for user wmdesiko - https://phabricator.wikimedia.org/T404175 (10Siko_WMDE) 03NEW [10:00:15] 10Toolforge (Toolforge iteration 24): [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176 (10dcaro) 03NEW [10:10:01] 10Toolforge (Toolforge iteration 24): [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176#11166586 (10dcaro) It's actually kinda choppy, my guess is that some workers are failing to send logs: ` tools.automated-toolforge-tests@tools-bastion-13:~$ toolforge jobs logs test-1... [10:21:55] (03update) 10dcaro: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) (owner: 10raymond-ndibe) [10:24:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-12 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:10:51] 10VPS-project-Codesearch: CodeSearch is unresponsive (2025-09-10) - https://phabricator.wikimedia.org/T404163#11167065 (10hashar) It looks all good indead, thank you @volans! [12:30:16] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T403964) [12:30:21] T403964: Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964 [12:31:10] !log fnegri@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-kubeusers (T403964) [12:31:21] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T403964) [12:45:56] !log fnegri@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers (T403964) [12:46:00] T403964: Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964 [13:24:41] (03open) 10dcaro: ansible: allow nsswitch to resolve localhost [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/274 [13:33:17] 10Tool-archive-externa-links: [Documentation] Conception d'une nouvelle capsule vidéo pour l'installation du script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404193 (10poro26) 03NEW [13:43:02] 10Tool-archive-externa-links: [Documentation] Conception d'une nouvelle capsule vidéo pour l'installation du script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404193#11167409 (10poro26) [13:45:31] (03approved) 10fnegri: ansible: allow nsswitch to resolve localhost [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/274 (owner: 10dcaro) [13:47:53] (03approved) 10dcaro: maintain-kubeusers: bump to 0.0.181-20250909080634-8d3f947f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/959 (https://phabricator.wikimedia.org/T403962) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:47:57] (03update) 10dcaro: maintain-kubeusers: bump to 0.0.181-20250909080634-8d3f947f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/959 (https://phabricator.wikimedia.org/T403962) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:48:22] (03merge) 10dcaro: maintain-kubeusers: bump to 0.0.181-20250909080634-8d3f947f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/959 (https://phabricator.wikimedia.org/T403962) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [13:48:46] (03merge) 10dcaro: ansible: allow nsswitch to resolve localhost [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/274 [13:49:09] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:49:16] (03update) 10dcaro: functional-tests: fix log checking tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/961 [13:50:57] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964#11167515 (10fnegri) 05In progress→03Resolved Quotas increased: ` tools.cluebotng-review@tools-bastion-13:~$ kubectl describe quota Name:... [14:01:08] (03update) 10fnegri: Increase quotas for tool cluebotng-review [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/960 (https://phabricator.wikimedia.org/T403964) [14:08:24] !log fnegri@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T403964) [14:08:29] T403964: Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964 [14:09:41] !log dcaro@acme admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [14:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:09:55] !log dcaro@acme admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [14:09:58] !log dcaro@acme admin START - Cookbook wmcs.openstack.cloudvirt.vm_console [14:10:00] !log dcaro@acme admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=99) [14:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:10:50] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [14:24:13] 10Toolforge (Toolforge iteration 24): [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176#11167723 (10dcaro) So, at least one issue I think might be happening is: * alloy only checks every ~10s for new files in the kubelet `/var/log/containers/` directory * if a job take... [14:25:53] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199 (10dcaro) 03NEW [14:26:18] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.openstack.cloudvirt.vm_console [14:28:40] !log fnegri@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers (T403964) [14:28:44] T403964: Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964 [14:30:06] (03merge) 10fnegri: Increase quotas for tool cluebotng-review [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/960 (https://phabricator.wikimedia.org/T403964) [14:30:09] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11167770 (10dcaro) It has lost all network config: ` root@tools-prometheus-9:~# ip a 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000... [14:31:46] 10Toolforge (Toolforge iteration 24): [prometheus,infra] 2025-09-10 tools-prometheus-9 down - https://phabricator.wikimedia.org/T404199#11167787 (10dcaro) The logs show this: ` Sep 09 03:16:54 tools-prometheus-9 systemd-networkd[1527555]: ens3: Could not set route: Connection timed out ens3: Failed ` And then b... [14:33:12] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11167795 (10fnegri) I think we should really shut down the Buster bastion host, as Buster has been EOL for more than 1 year now. If the curre... [14:35:28] FIRING: JobsEmailerNoEmails: No emails sent in the last hour - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails [14:37:50] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [14:40:28] RESOLVED: [2x] JobsEmailerNoEmails: No emails sent in the last five hours - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerNoEmails - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerNoEmails [14:40:58] RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:54:35] 06cloud-services-team, 06Infrastructure-Foundations, 10Puppet-Core, 13Patch-For-Review: puppet: profile::auto_restarts::service: have a way to don't deploy the systemd timers - https://phabricator.wikimedia.org/T336845#11167931 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Ok change deployed, c... [15:08:22] FIRING: HAProxyBackendUnavailable: HAProxy service radosgw-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:13:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service radosgw-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:27:57] 10Toolforge (Toolforge iteration 24): [jobs-api] loki logs take really long to appear - https://phabricator.wikimedia.org/T404176#11168058 (10dcaro) I think I found the issue, it's the `successfulJobsHistoryLimit` set to `0` in the cronjob itself, as jobs linger for ~30s. Manually setting that to `1` makes the... [15:32:57] (03open) 10dcaro: scheduledjobs: increase the history to allow log retrieval [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/214 [15:38:18] (03open) 10dcaro: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 [15:40:32] (03update) 10taavi: Makefile: Add targets for format, tidy, and test [repos/cloud/cloud-vps/tofu-cloudvps] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-cloudvps/-/merge_requests/12 (owner: 10bd808) [15:42:17] (03approved) 10taavi: Makefile: Add targets for format, tidy, and test [repos/cloud/cloud-vps/tofu-cloudvps] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-cloudvps/-/merge_requests/12 (owner: 10bd808) [15:42:46] (03merge) 10taavi: Makefile: Add targets for format, tidy, and test [repos/cloud/cloud-vps/tofu-cloudvps] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-cloudvps/-/merge_requests/12 (owner: 10bd808) [15:47:27] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 13Patch-For-Review: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11168180 (10taavi) {{done}} I just tagged and published a 0.4.0 release. [15:51:05] (03merge) 10taavi: dns: Reference floating IP module resources [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/71 [15:51:09] (03update) 10taavi: toolsbeta: dns: Add MX records for toolsbeta.org [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/73 (https://phabricator.wikimedia.org/T394997) [16:01:03] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11168262 (10taavi) Repeating my question from the last time: are there any remaining infrastructure blockers remaining to make this happen? I... [16:13:45] 06cloud-services-team, 10Data-Services, 10BetaFeatures, 06Data-Engineering, 06Data-Platform-SRE: Create view for betafeatures_user_counts table in wiki replicas - https://phabricator.wikimedia.org/T402145#11168322 (10Ottomata) [16:22:04] 06cloud-services-team, 10Striker, 13Patch-For-Review: Attaching Phabricator account to a second Developer account via Striker results in a fatal error - https://phabricator.wikimedia.org/T319500#11168388 (10Anoop) @bd808 would it be possible to unlink Minato826 developer account from this phabricator account... [16:23:17] (03update) 10dcaro: loki.alloy: decrease frequency for fetching logs [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/962 [16:50:00] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup failst to start on linux - https://phabricator.wikimedia.org/T404226 (10dcaro) 03NEW [16:51:29] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup failst to start on linux - https://phabricator.wikimedia.org/T404226#11168677 (10dcaro) Coredns is showing the errors: ` │ .:53... [16:53:15] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup failst to start on linux - https://phabricator.wikimedia.org/T404226#11168686 (10dcaro) p:05Triage→03High [16:57:12] 10Toolforge (Toolforge iteration 24): [logging,lima-kilo] loki setup failst to start on linux - https://phabricator.wikimedia.org/T404226#11168710 (10dcaro) [17:04:32] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814#11168762 (10LucasWerkmeister) 05Open→03Resolved Alright, should be done. Thanks everyone! [17:04:36] (03CR) 10Majavah: [C:03+2] docker: Simplify Bitu setup instructions [labs/striker] - 10https://gerrit.wikimedia.org/r/1186628 (owner: 10BryanDavis) [17:06:05] (03Merged) 10jenkins-bot: docker: Simplify Bitu setup instructions [labs/striker] - 10https://gerrit.wikimedia.org/r/1186628 (owner: 10BryanDavis) [17:13:06] (03CR) 10Majavah: [C:03+2] phab_attach: Give notice when Phabricator account is already in use [labs/striker] - 10https://gerrit.wikimedia.org/r/1186629 (https://phabricator.wikimedia.org/T319500) (owner: 10BryanDavis) [17:14:43] (03Merged) 10jenkins-bot: phab_attach: Give notice when Phabricator account is already in use [labs/striker] - 10https://gerrit.wikimedia.org/r/1186629 (https://phabricator.wikimedia.org/T319500) (owner: 10BryanDavis) [17:15:36] 06cloud-services-team, 10Striker, 13Patch-For-Review: Attaching Phabricator account to a second Developer account via Striker results in a fatal error - https://phabricator.wikimedia.org/T319500#11168832 (10bd808) >>! In T319500#11168388, @Anoop wrote: > @bd808 would it be possible to unlink Minato826 develo... [17:30:59] (03merge) 10taavi: toolsbeta: dns: Add MX records for toolsbeta.org [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/73 (https://phabricator.wikimedia.org/T394997) [17:31:00] (03update) 10taavi: dns: Unify emails in SOA records [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/74 [17:33:49] (03merge) 10taavi: dns: Unify emails in SOA records [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/74 [17:33:56] (03update) 10taavi: tools: dns: Drop temporary migration names [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/75 [17:42:33] 10toolforge_i18n: Add support for {{#FORMAL:}} to toolforge_i18n - https://phabricator.wikimedia.org/T404232 (10LucasWerkmeister) 03NEW [17:44:07] 10toolforge_i18n: Add support for {{#FORMAL:}} to toolforge_i18n - https://phabricator.wikimedia.org/T404232#11168965 (10LucasWerkmeister) (For the record, I very well might only start working on this whenever it becomes actually necessary, i.e. when either a translator has used the `{{#FORMAL:}}` feature in a t... [17:45:33] (03PS1) 10Majavah: Fix loading message template file [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187050 [17:45:34] (03PS1) 10Majavah: Fix app logging [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187051 [17:46:00] (03CR) 10CI reject: [V:04-1] Fix app logging [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187051 (owner: 10Majavah) [17:46:15] (03CR) 10Majavah: [C:03+2] Fix loading message template file [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187050 (owner: 10Majavah) [17:46:40] (03PS2) 10Majavah: Fix app logging [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187051 [17:46:44] (03Merged) 10jenkins-bot: Fix loading message template file [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187050 (owner: 10Majavah) [17:47:09] (03CR) 10Majavah: [C:03+2] Fix app logging [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187051 (owner: 10Majavah) [17:48:13] (03Merged) 10jenkins-bot: Fix app logging [labs/tools/github-pr-closer] - 10https://gerrit.wikimedia.org/r/1187051 (owner: 10Majavah) [17:50:14] 10toolforge_i18n: Add support for {{#FORMAL:}} to toolforge_i18n - https://phabricator.wikimedia.org/T404232#11169000 (10LucasWerkmeister) I’m not completely sure how this should work yet. My guess is we’d still keep separate language codes, for user preferences and/or `?uselang=` purposes, and infer the (in)for... [18:02:47] (03CR) 10Birusha: [C:03+1] fix: page views data not loading for selected project [labs/tools/mostvisitedarticle] - 10https://gerrit.wikimedia.org/r/1167263 (https://phabricator.wikimedia.org/T398342) (owner: 10Bovimacoco) [18:23:27] 06cloud-services-team, 10Striker: Cannot connect developer account to Phabricator (Error updating database) - https://phabricator.wikimedia.org/T404239 (10Anoop) 03NEW [18:24:12] 06cloud-services-team, 10Striker: Attaching Phabricator account to a second Developer account via Striker results in a fatal error - https://phabricator.wikimedia.org/T319500#11169214 (10Anoop) >>! In T319500#11168832, @bd808 wrote: >>>! In T319500#11168388, @Anoop wrote: >> @bd808 would it be possible to unli... [18:37:11] (03CR) 10Eugene233: [C:03+2] Remove the backend of this project to a new GitLab repo Update and test this repo to work as expected. [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1182211 (https://phabricator.wikimedia.org/T402970) (owner: 10Wandji collins) [18:37:59] (03Merged) 10jenkins-bot: Remove the backend of this project to a new GitLab repo Update and test this repo to work as expected. [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1182211 (https://phabricator.wikimedia.org/T402970) (owner: 10Wandji collins) [19:01:39] 06cloud-services-team, 10Cloud-VPS, 10Ceph: [ceph,eqiad1] upgrade from quincy->reef (and bookworm) - https://phabricator.wikimedia.org/T404249 (10Andrew) 03NEW [19:01:53] 10VPS-project-voterlists: Create a web frontend for triggering list generation - https://phabricator.wikimedia.org/T404250 (10SD0001) 03NEW [19:02:25] 06cloud-services-team, 10Cloud-VPS, 10Ceph: [ceph,eqiad1] upgrade from pacific->quincy - https://phabricator.wikimedia.org/T402190#11169467 (10Andrew) 05Open→03Resolved Everything is now Quincy/Bullseye now, seems good. [19:02:55] 06cloud-services-team, 10Cloud-VPS, 10Ceph, 13Patch-For-Review: [ceph,eqiad1] upgrade from quincy->reef (and bookworm) - https://phabricator.wikimedia.org/T404249#11169469 (10Andrew) p:05Triage→03Medium [19:03:35] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 06SRE-OnFire, 10Sustainability (Incident Followup): [ceph,codfw1dev] upgrade the hosts from pacific->quincy - https://phabricator.wikimedia.org/T400334#11169471 (10Andrew) 05Open→03Resolved Everything is now on Quincy + Reef. [19:10:39] 10Cloud-VPS (Quota-requests): Increase ioops for recommendation-api project - https://phabricator.wikimedia.org/T404254 (10fkaelin) 03NEW [19:24:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [19:24:11] (03CR) 10Shadabgdg: "Please review this patch." [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1181245 (https://phabricator.wikimedia.org/T316197) (owner: 10Rehan_khan_78) [19:39:54] (03CR) 10Aklapper: "Please do not ask everyday to review a patch - thanks." [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1181245 (https://phabricator.wikimedia.org/T316197) (owner: 10Rehan_khan_78) [19:49:02] 06cloud-services-team, 10Striker: unlink Minato826 developer account from @Anoop phabricator account - https://phabricator.wikimedia.org/T404239#11169689 (10bd808) [19:50:02] 06cloud-services-team, 10Striker: unlink Minato826 developer account from @Anoop phabricator account - https://phabricator.wikimedia.org/T404239#11169693 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808 [19:56:26] 06cloud-services-team, 10Striker: unlink Minato826 developer account from @Anoop phabricator account - https://phabricator.wikimedia.org/T404239#11169712 (10bd808) 05In progress→03Resolved ` striker_admin@m5-master.eqiad.wmnet(striker)> select id, ldapname from labsauth_labsuser where phabname = 'Anoop... [20:05:18] 10Tool-archive-externa-links: [Documentation] Réalisation d'une nouvelle capsule vidéo pour l'installation du script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404193#11169732 (10poro26) [20:50:00] 10Tool-quickcategories, 10MediaWiki-Core-AuthManager, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 10MW-1.45-notes (1.45.0-wmf.19; 2025-09-16): Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519#11169911 (10matm... [20:50:40] 10Tool-quickcategories, 10MediaWiki-Core-AuthManager, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team, 10MW-1.45-notes (1.45.0-wmf.19; 2025-09-16): Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519#11169915 (10Luca... [21:14:47] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 13Patch-For-Review: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11169978 (10bd808) 05In progress→03Resolved [21:51:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [21:51:43] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [21:52:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [21:52:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [21:53:58] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [21:57:33] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [21:59:01] RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [22:02:18] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [22:05:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [22:05:43] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [22:06:45] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [22:06:54] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.osd.reactivate (exit_code=99) [22:09:19] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.reactivate [22:09:51] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.ceph.osd.reactivate (exit_code=0) [22:29:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning