[02:12:44] (03PS1) 10Dzahn: secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) [02:15:43] (03CR) 10Dzahn: [V:03+2 C:03+2] secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [02:15:51] (03PS2) 10Dzahn: secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) [02:15:55] (03CR) 10Dzahn: [V:03+2] secrets: add fake SSH private key for zuul [labs/private] - 10https://gerrit.wikimedia.org/r/1161093 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [03:21:37] (03PS1) 10Andrew Bogott: Comment back in cinder ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/1161116 [03:22:02] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] Comment back in cinder ldap passwords [labs/private] - 10https://gerrit.wikimedia.org/r/1161116 (owner: 10Andrew Bogott) [04:25:22] FIRING: [12x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:26:17] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:26:22] FIRING: [2x] HAProxyServiceUnavailable: HAProxy service neutron-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [04:26:32] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T397390 (10phaultfinder) 03NEW [04:27:17] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:28:35] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:29:35] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:03] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:17] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:30:56] FIRING: SystemdUnitDown: The service unit nova-fullstack.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:31:03] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:31:17] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:37] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:33:37] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:03] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:37] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:34:37] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:35:03] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:36:23] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:37:23] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:40:17] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [04:45:56] FIRING: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:48:07] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [04:50:56] RESOLVED: SystemdUnitDown: The service unit designate_floating_ip_ptr_records_updater.service is in failed status on host cloudcontrol1006. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1006 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:55:52] RESOLVED: [24x] HAProxyBackendUnavailable: HAProxy service cinder-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [04:57:52] RESOLVED: [9x] HAProxyServiceUnavailable: HAProxy service heat-api_backend has no available backends on cloudlb1001:9900 - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyServiceUnavailable [06:52:49] 10VPS-project-Phabricator, 06collaboration-services: Requesting manual activation of phabricator.wmcloud.org accounts - https://phabricator.wikimedia.org/T397280#10930493 (10A_smart_kitten) Thank you @dzahn! All seems to work okay :) [07:49:14] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T397390#10930715 (10dcaro) @Andrew This seems fixed now, though it happened during your working hours I think and I see maybe it's related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1161148 ? [07:49:37] 06cloud-services-team: HAProxyServiceUnavailable - https://phabricator.wikimedia.org/T397390#10930716 (10dcaro) p:05Triage→03Low [07:53:26] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930735 (10dcaro) This is still happening, it seems to be timing out when waiting for the reverse DNS cleanup: ` Jun 19 0... [07:58:08] 06cloud-services-team: SystemdUnitDown The systemd unit backup_cinder_volumes.service on node cloudbackup1001-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T397105#10930752 (10dcaro) 05Open→03Resolved a:03dcaro [07:58:54] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930755 (10dcaro) It seems it has been flapping very often lately (https://grafana-rw.wikimedia.org/d/ebJoA6VWz/wmcs-opens... [08:05:43] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930852 (10dcaro) Got specially choppy in the last couple of days: {F62387966} [08:05:54] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930864 (10dcaro) Cleaned up all the existing VMs, trying to get a clean run [08:16:10] 10wikitech.wikimedia.org: Wikitech double redirect bot needs new SUL OAuth credentials after Wikitech authn changes - https://phabricator.wikimedia.org/T376224#10930958 (10taavi) 05Open→03Resolved a:03taavi [08:16:30] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10930961 (10dcaro) Hmm.... the fullstack logs did successfully remove the VM, but the DNS records are still there for a dif... [08:28:12] (03open) 10dcaro: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 [08:28:27] (03update) 10dcaro: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 [08:29:18] (03update) 10dcaro: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] (use_markdown) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 [08:29:22] (03update) 10dcaro: makefile: support podman [repos/cloud/cloud-vps/horizon/deploy] (use_markdown) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/5 [08:29:38] (03update) 10dcaro: README: add dev notes about authentication [repos/cloud/cloud-vps/horizon/deploy] (support_podman) - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/6 [08:29:41] (03update) 10dcaro: README: use makrdown for nice presentation in gitlab [repos/cloud/cloud-vps/horizon/deploy] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/horizon/deploy/-/merge_requests/4 [08:31:28] 06cloud-services-team: SystemdUnitDown The systemd unit nova-fullstack.service on node cloudcontrol1007 has been failing for more than two hours. - https://phabricator.wikimedia.org/T397357#10931044 (10dcaro) 05Open→03Resolved a:03dcaro I cleaned up all the VMs, and ran the `wmcs-dnsleaks --delpoyment... [08:33:12] 06cloud-services-team, 10Horizon, 13Patch-For-Review: Horizon proxy tab Edit buttons not working - https://phabricator.wikimedia.org/T397272#10931063 (10dcaro) 05Open→03In progress p:05Triage→03Medium a:03dcaro [08:33:28] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Horizon, 13Patch-For-Review: Horizon proxy tab Edit buttons not working - https://phabricator.wikimedia.org/T397272#10931068 (10dcaro) [08:33:42] 06cloud-services-team: SystemdUnitDown The systemd unit backup_cinder_volumes.service on node cloudbackup1002-dev has been failing for more than two hours. - https://phabricator.wikimedia.org/T397100#10931082 (10dcaro) 05Open→03Resolved a:03dcaro This is fixed now [08:35:44] 06cloud-services-team: KernelErrors Server cloudcephosd1024 logged kernel errors - https://phabricator.wikimedia.org/T396937#10931091 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` root@cloudcephosd1024:~# journalctl -k -p err -- Journal begins at Sat 2025-06-14 21:29:07 UTC, ends at Thu 2025-06-19... [08:36:01] 06cloud-services-team: NovafullstackSustainedFailures Novafullstack tests have been failing for more than 5hours in eqiad - https://phabricator.wikimedia.org/T396934#10931097 (10dcaro) 05Open→03Resolved a:03dcaro Cleaned up and restarted, and now it's working. [08:37:56] 06cloud-services-team: KernelErrors Server cloudcephosd1015 logged kernel errors - https://phabricator.wikimedia.org/T396796#10931107 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:05] 06cloud-services-team: KernelErrors Server cloudcephosd1016 logged kernel errors - https://phabricator.wikimedia.org/T396801#10931111 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:10] 06cloud-services-team: KernelErrors - https://phabricator.wikimedia.org/T396810#10931115 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:16] 06cloud-services-team: KernelErrors Server cloudcephosd1017 logged kernel errors - https://phabricator.wikimedia.org/T396832#10931120 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:25] 06cloud-services-team: KernelErrors Server cloudcephosd1018 logged kernel errors - https://phabricator.wikimedia.org/T396859#10931124 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:33] 06cloud-services-team: KernelErrors Server cloudcephosd1019 logged kernel errors - https://phabricator.wikimedia.org/T396909#10931128 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:39] 06cloud-services-team: KernelErrors Server cloudcephosd1020 logged kernel errors - https://phabricator.wikimedia.org/T396917#10931132 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:45] 06cloud-services-team: KernelErrors Server cloudcephosd1022 logged kernel errors - https://phabricator.wikimedia.org/T396921#10931136 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:38:53] 06cloud-services-team: KernelErrors Server cloudcephosd1023 logged kernel errors - https://phabricator.wikimedia.org/T396929#10931151 (10dcaro) 05Open→03Resolved a:03dcaro Expected: ` Jun 17 19:03:08 cloudcephosd1014 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS... [08:41:06] 06cloud-services-team: PuppetFailure Puppet has failed on cloudcontrol2010-dev:9100 - https://phabricator.wikimedia.org/T396769#10931156 (10dcaro) 05Open→03Resolved a:03dcaro Not failing anymore. [08:43:20] 10Tool-query-chest, 10Wikidata, 10Wikidata Query UI: Use query-chest for short URLs when the w.wiki shortener fails for long queries - https://phabricator.wikimedia.org/T334893#10931160 (10jhsoby) [08:52:56] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10931174 (10dcaro) p:05Triage→03High [08:53:02] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10931176 (10dcaro) [08:53:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10931177 (10dcaro) a:03Andrew [08:57:25] !log dcaro@acme toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [08:57:26] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:57:27] T397339: Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339 [08:57:30] !log dcaro@acme toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [08:57:31] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:57:33] !log dcaro@acme toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [08:57:34] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:57:44] !log dcaro@acme toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [08:57:44] wmbot~dcaro@acme: Unknown project "toolsbeta-logging" [08:59:24] !log dcaro@cloudcumin1001 toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [08:59:24] dcaro@cloudcumin1001: Unknown project "toolsbeta-logging" [09:00:03] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project toolsbeta-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 (https://phabricator.wikimedia.org/T397339) [09:03:28] (03approved) 10taavi: projects: added project toolsbeta-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 (https://phabricator.wikimedia.org/T397339) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [09:03:34] dcaro@cloudcumin1001 create_project (PID 3616594) is awaiting input [09:03:48] (03merge) 10dcaro: projects: added project toolsbeta-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/249 (https://phabricator.wikimedia.org/T397339) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [09:06:19] !log dcaro@cloudcumin1001 toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [09:06:23] T397339: Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339 [09:08:28] !log dcaro@cloudcumin1001 toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [09:09:29] !log dcaro@cloudcumin1001 toolsbeta-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project toolsbeta-logging in eqiad1 (T397339) [09:09:48] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Data-Services, 06Data-Persistence: Migrate clouddb* hosts to MariaDB 10.11 - https://phabricator.wikimedia.org/T394372#10931248 (10Marostegui) Thank you! [09:34:16] !log dcaro@cloudcumin1001 toolsbeta-logging START - Cookbook wmcs.vps.create_project for project toolsbeta-logging in eqiad1 (T397339) [09:34:19] T397339: Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339 [09:36:14] !log dcaro@cloudcumin1001 toolsbeta-logging END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project toolsbeta-logging in eqiad1 (T397339) [11:18:30] 06cloud-services-team, 10Bitu, 06Infrastructure-Foundations, 07LDAP: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663#10931620 (10MoritzMuehlenhoff) >>! In T355663#10852835, @jhathaway wrote: > I'm not sure it is much of an issue, but that range overlaps with `s... [11:22:34] 06cloud-services-team, 10Bitu, 06Infrastructure-Foundations, 07LDAP: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663#10931634 (10SLyngshede-WMF) Sounds good to me, 400.000 should last a pretty long time. I still think that we should stop allocating uidNumbers... [11:47:57] (03open) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [11:48:00] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [11:49:51] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [11:50:09] (03PS8) 10Slyngshede: Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 [11:50:42] (03CR) 10CI reject: [V:04-1] Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 (owner: 10Slyngshede) [11:52:47] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [11:55:26] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [11:57:19] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [11:58:35] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [12:01:07] (03update) 10taavi: Use separate project for log storage buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/50 (https://phabricator.wikimedia.org/T396574) [12:02:14] 06cloud-services-team, 10Bitu, 06Infrastructure-Foundations, 07LDAP: Allocate more available UNIX UIDs for human users - https://phabricator.wikimedia.org/T355663#10931823 (10MoritzMuehlenhoff) >>! In T355663#10931634, @SLyngshede-WMF wrote: > Sounds good to me, 400.000 should last a pretty long time. > >... [12:08:51] 10Cloud-VPS (Project-requests): Request creation of toolsbeta-logging VPS project - https://phabricator.wikimedia.org/T397339#10931851 (10taavi) 05Open→03Resolved a:03dcaro [12:10:30] 10Cloud-VPS (Project-requests): Request creation of VPS project - https://phabricator.wikimedia.org/T397446 (10taavi) 03NEW [12:10:32] 10Cloud-VPS (Project-requests): Request creation of VPS project - https://phabricator.wikimedia.org/T397446#10931872 (10taavi) [12:10:40] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Provision object storage volumes for Loki - https://phabricator.wikimedia.org/T396574#10931873 (10taavi) [12:10:52] 10Cloud-VPS (Project-requests): Request creation of tools-logging VPS project - https://phabricator.wikimedia.org/T397446#10931874 (10taavi) [12:25:17] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/1161499 (owner: 10L10n-bot) [12:32:46] 10Cloud-VPS (Project-requests): Request creation of tools-logging VPS project - https://phabricator.wikimedia.org/T397446#10931921 (10dcaro) +1 [12:33:06] !log dcaro@cloudcumin1001 tools-logging START - Cookbook wmcs.vps.create_project for project tools-logging in eqiad1 (T397446) [12:33:07] dcaro@cloudcumin1001: Unknown project "tools-logging" [12:33:08] T397446: Request creation of tools-logging VPS project - https://phabricator.wikimedia.org/T397446 [12:33:47] (03update) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project tools-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/250 (https://phabricator.wikimedia.org/T397446) [12:33:51] (03open) 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49: projects: added project tools-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/250 (https://phabricator.wikimedia.org/T397446) [12:35:09] (03approved) 10dcaro: projects: added project tools-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/250 (https://phabricator.wikimedia.org/T397446) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [12:35:12] (03merge) 10dcaro: projects: added project tools-logging [repos/cloud/cloud-vps/tofu-infra] - 10https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/250 (https://phabricator.wikimedia.org/T397446) (owner: 10group_199_bot_333a6c67971a471aeb1cf0b14ccf9f49) [12:35:16] !log dcaro@cloudcumin1001 tools-logging END (ERROR) - Cookbook wmcs.vps.create_project (exit_code=97) for project tools-logging in eqiad1 (T397446) [12:35:17] dcaro@cloudcumin1001: Unknown project "tools-logging" [12:35:21] !log dcaro@cloudcumin1001 tools-logging START - Cookbook wmcs.vps.create_project for project tools-logging in eqiad1 (T397446) [12:35:21] dcaro@cloudcumin1001: Unknown project "tools-logging" [12:38:13] !log dcaro@cloudcumin1001 tools-logging END (FAIL) - Cookbook wmcs.vps.create_project (exit_code=99) for project tools-logging in eqiad1 (T397446) [12:38:16] T397446: Request creation of tools-logging VPS project - https://phabricator.wikimedia.org/T397446 [12:43:09] !log dcaro@cloudcumin1001 tools-logging START - Cookbook wmcs.vps.create_project for project tools-logging in eqiad1 (T397446) [12:46:22] !log dcaro@cloudcumin1001 tools-logging END (PASS) - Cookbook wmcs.vps.create_project (exit_code=0) for project tools-logging in eqiad1 (T397446) [12:46:26] T397446: Request creation of tools-logging VPS project - https://phabricator.wikimedia.org/T397446 [12:47:31] 10Cloud-VPS (Project-requests), 13Patch-For-Review: Request creation of tools-logging VPS project - https://phabricator.wikimedia.org/T397446#10931996 (10dcaro) 05Open→03Resolved p:05Triage→03High [12:48:50] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/map-of-monuments] - 10https://gerrit.wikimedia.org/r/1161499 (owner: 10L10n-bot) [12:58:57] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:00:12] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.012 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:03:28] FIRING: InstanceDown: Project tools instance tools-nfs-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:03:31] FIRING: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [13:03:57] FIRING: [4x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:04:25] 06cloud-services-team, 10Toolforge: Cannot log into Toolforge - https://phabricator.wikimedia.org/T397451 (10MBH) 03NEW [13:13:28] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 50.585 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [13:13:28] RESOLVED: InstanceDown: Project tools instance tools-nfs-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:14:58] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for all NFS workers [13:18:22] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [13:18:31] RESOLVED: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [13:18:57] FIRING: [4x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:21:31] FIRING: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [13:23:57] RESOLVED: ProbeDown: Service tools-k8s-haproxy-6:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-6:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [13:26:31] RESOLVED: ToolsToolsDBWritableState: There should be exactly one writable MariaDB instance instead of 0 - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsToolsDBWritableState [13:42:38] 06cloud-services-team, 10Toolforge: Cannot log into Toolforge - https://phabricator.wikimedia.org/T397451#10932253 (10taavi) 05Open→03Resolved a:03taavi This was caused by an outage of the Toolforge NFS server that we believe is now fixed. [13:47:04] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [13:47:07] 06cloud-services-team, 10Toolforge: [toolsbeta,tofu,infra] There's some discrepancy between the volumes in toolsbeta and tofu - https://phabricator.wikimedia.org/T396276#10932270 (10taavi) 05Open→03Resolved a:03dcaro Resolved with https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-... [13:50:04] 06cloud-services-team, 10Toolforge: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538#10932282 (10dcaro) 05Open→03Resolved @Ykhwong this was caused by a wider outage in toolforge, should be working agan, please reopen if you still face issues. [13:52:24] 06cloud-services-team, 10Toolforge: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538#10932291 (10Ykhwong) 05Resolved→03Open Thanks for the update. However, I'm still experiencing the issue. When I run the become command on login-buster.toolforg... [13:54:42] 06cloud-services-team, 10Toolforge: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538#10932299 (10dcaro) Yep, it seems it's still hanging (note that it does not happens with all tools, `wm-lol` did work, but `tedbot` does not), I'll reboot 👍 [13:55:52] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [13:56:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-74 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [13:57:27] 06cloud-services-team, 10Toolforge: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538#10932327 (10Ykhwong) 05Open→03Resolved Thanks for the reboot — the issue seems to be resolved now. become is working properly again on login-buster.toolfor... [13:58:45] 06cloud-services-team, 10Toolforge: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538#10932340 (10dcaro) @Ykhwong awesome :), may I ask why are you using the old buster bastion and not the newer one? (so we can provide whatever is missing for yo... [14:00:25] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 21), 07Epic, 05Goal: [infra] Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#10932342 (10dcaro) [14:01:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-74 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:02:34] 06cloud-services-team, 10Toolforge: `become` command not working properly on login-buster.toolforge.org - https://phabricator.wikimedia.org/T391538#10932345 (10Ykhwong) Oh, I didn't realize I was still using the Buster bastion. Thanks for letting me know. I'll check out the migration guide and start transi... [14:06:21] 06cloud-services-team, 10Toolforge: Lock down tools-sgebastion-10 (login-buster.toolforge.org) to only members of tools with known dependencies on it - https://phabricator.wikimedia.org/T397459 (10taavi) 03NEW [14:07:08] 06cloud-services-team, 10Toolforge: Lock down tools-sgebastion-10 (login-buster.toolforge.org) to only members of tools with known dependencies on it - https://phabricator.wikimedia.org/T397459#10932360 (10taavi) [14:09:57] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:10:20] 06cloud-services-team, 10Toolforge: [toolsdb] Revisit WritableState alert - https://phabricator.wikimedia.org/T397460 (10fnegri) 03NEW [14:10:59] 06cloud-services-team, 10Toolforge: [toolsdb] Revisit WritableState alert - https://phabricator.wikimedia.org/T397460#10932377 (10fnegri) p:05Triage→03Medium [14:19:57] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [14:55:32] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [15:25:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:29:21] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [15:30:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:50:03] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [15:58:15] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [16:02:57] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:04:23] (03update) 10chuckonwumelu: show: Display latest deployment if no deploy_id included [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T394994) [16:07:57] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:10:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-39 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:12:56] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [16:15:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-39 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:17:57] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [16:18:47] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [16:18:58] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-38 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:21:11] (03approved) 10fnegri: show: Display latest deployment if no deploy_id included [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T394994) (owner: 10chuckonwumelu) [16:23:58] RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-38 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:26:47] 10Toolforge (Toolforge iteration 21): [components-api] Add all missing options for scheduled components - https://phabricator.wikimedia.org/T395071#10932707 (10dcaro) a:03dcaro [16:26:57] 10Toolforge (Toolforge iteration 21): [components-api] add all the missing options for continuous components - https://phabricator.wikimedia.org/T395070#10932710 (10dcaro) a:05Raymond_Ndibe→03dcaro [16:39:44] (03update) 10chuckonwumelu: show: Display latest deployment if no deploy_id included [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T394994) [16:58:31] (03open) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 [16:58:47] 10Toolforge (Toolforge iteration 21): [components-api] add all the missing options for continuous components - https://phabricator.wikimedia.org/T395070#10932775 (10dcaro) 05Open→03In progress [16:59:01] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [17:00:57] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [17:05:34] FIRING: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.995% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:15:17] (03approved) 10dcaro: health_check: default to 'type' but support 'health_check_type' [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/107 (https://phabricator.wikimedia.org/T396210) [17:16:30] (03update) 10dcaro: health_check: default to 'type' but support 'health_check_type' [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/107 (https://phabricator.wikimedia.org/T396210) [17:16:34] (03merge) 10dcaro: health_check: default to 'type' but support 'health_check_type' [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/107 (https://phabricator.wikimedia.org/T396210) [17:17:13] (03open) 10dcaro: d/changelog: bump to 16.1.14 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/108 (https://phabricator.wikimedia.org/T396210) [17:17:38] (03update) 10dcaro: d/changelog: bump to 16.1.14 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/108 (https://phabricator.wikimedia.org/T396210) [17:18:12] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [17:19:11] 06cloud-services-team, 10Striker: Striker should use ID instead of username to identify SUL accounts - https://phabricator.wikimedia.org/T359428#10932856 (10Arendpieter) @taavi What’s still left to do on this issue? All the pull requests are merged. I’m looking for an interesting Python issue to work on 😉 [17:23:15] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [17:23:30] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [17:23:37] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-cli [17:28:13] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-cli [17:32:57] FIRING: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:33:48] (03approved) 10dcaro: d/changelog: bump to 16.1.14 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/108 (https://phabricator.wikimedia.org/T396210) [17:33:55] (03merge) 10dcaro: d/changelog: bump to 16.1.14 [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/108 (https://phabricator.wikimedia.org/T396210) [17:36:21] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [17:37:06] (03approved) 10dcaro: health-check: return `type` by default [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/174 (https://phabricator.wikimedia.org/T396210) [17:37:09] (03merge) 10dcaro: health-check: return `type` by default [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/174 (https://phabricator.wikimedia.org/T396210) [17:37:57] RESOLVED: ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-k8s-haproxy-5:30000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [17:40:55] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.381-20250619173722-eab6c9fe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/819 (https://phabricator.wikimedia.org/T396210) [17:41:03] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:45:55] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [17:48:15] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [17:49:06] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [17:55:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:57:14] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [17:57:34] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-27 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:58:09] (03approved) 10dcaro: jobs-api: bump to 0.0.381-20250619173722-eab6c9fe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/819 (https://phabricator.wikimedia.org/T396210) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:58:12] (03merge) 10dcaro: jobs-api: bump to 0.0.381-20250619173722-eab6c9fe [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/819 (https://phabricator.wikimedia.org/T396210) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:59:32] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [18:00:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:02:03] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [18:04:32] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for all NFS workers [18:15:34] RESOLVED: DiskSpace: Disk space cloudbackup1004:9100:/srv 5.991% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudbackup1004 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [18:23:28] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395071) [18:25:53] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395071) [18:27:56] (03update) 10chuckonwumelu: show: Display latest deployment if no deploy_id included [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T394994) [18:28:56] (03merge) 10chuckonwumelu: GET the latest deployment for a particular tool [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/87 (https://phabricator.wikimedia.org/T394990) [18:29:23] (03merge) 10chuckonwumelu: show: Display latest deployment if no deploy_id included [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/36 (https://phabricator.wikimedia.org/T394994) [18:31:20] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.120-20250619182909-09ea62ae [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/820 (https://phabricator.wikimedia.org/T394990) [18:35:54] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [18:36:40] (03open) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] (add_all_continuous_options) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [18:36:57] (03update) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] (add_all_continuous_options) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [18:43:09] !log chuckonwumelu@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [18:46:41] !log chuckonwumelu@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [20:25:28] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance cvn-app10 in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun