[01:34:50] (03update) 10ahecht: Draft: Cache database queries [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/3 [01:44:35] (03update) 10ahecht: Draft: Cache database queries [toolforge-repos/afdstats] - 10https://gitlab.wikimedia.org/toolforge-repos/afdstats/-/merge_requests/3 [02:01:48] (03PS1) 10Jacob4code: Cannot read properties of undefined fixed and minor changes. [labs/tools/WdTmCollab] - 10https://gerrit.wikimedia.org/r/1163075 [06:58:53] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 06collaboration-services, 10GitLab (Infrastructure): Volume is stuck to deleted instance in devtools project - https://phabricator.wikimedia.org/T396739#10940830 (10Jelto) >>! In T396739#10934468, @Andrew wrote: > I'm still wrestling with gitlab-prod-... [07:00:28] FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:05:30] (03update) 10dcaro: components-api: deploy also on tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/785 [07:18:34] (03PS1) 10Muehlenhoff: Add dummy secrets for debmonitor_dev [labs/private] - 10https://gerrit.wikimedia.org/r/1163211 [07:26:20] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [07:26:20] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [07:26:20] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [07:30:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [07:31:20] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [07:31:20] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [07:31:25] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [07:34:55] (03open) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [07:39:29] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [07:41:04] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [07:42:16] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [07:44:01] 06cloud-services-team, 10Domains, 06Traffic, 07IPv6: Add IPv6 glue records for WMCS Designate-hosted domains - https://phabricator.wikimedia.org/T397185#10940955 (10taavi) 05Open→03Resolved a:03ssingh Thanks! Everything looks fine from my end so closing. [07:45:40] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [07:46:09] (03CR) 10Elukey: [C:03+1] Add dummy secrets for debmonitor_dev [labs/private] - 10https://gerrit.wikimedia.org/r/1163211 (owner: 10Muehlenhoff) [07:56:20] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [08:35:07] 06cloud-services-team, 10Cloud-VPS: Create OpenStack role that allows object storage access only - https://phabricator.wikimedia.org/T396594#10941074 (10taavi) 05Resolved→03Open Minor problem: this role doesn't have access to create ec2 creds: `lang=shell-session taavi@cloudcontrol1007 ~ $ export OS_PASSW... [08:36:05] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [08:37:51] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy secrets for debmonitor_dev [labs/private] - 10https://gerrit.wikimedia.org/r/1163211 (owner: 10Muehlenhoff) [08:57:32] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [09:36:17] 06cloud-services-team, 10Toolforge, 07Documentation, 07Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919#10941318 (10Addshore) >>! In T321919#10939110, @dcaro wrote: > Can that be split from the cli? Yup. It's... [09:45:35] (03update) 10taavi: logging: Add values to deploy to toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/821 (https://phabricator.wikimedia.org/T386480) [09:56:11] (03approved) 10dcaro: logging: Add values to deploy to toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/821 (https://phabricator.wikimedia.org/T386480) (owner: 10taavi) [09:56:45] (03merge) 10taavi: logging: Add values to deploy to toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/821 (https://phabricator.wikimedia.org/T386480) [09:57:13] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component logging [09:57:21] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component logging [09:57:55] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 Several correlated potentially network issues during the night - https://phabricator.wikimedia.org/T397566#10941411 (10dcaro) The issue moved to -9, it had a blip this morning triggeringa page due to missing data.. [09:58:29] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component logging [09:58:36] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component logging [10:00:45] (03open) 10taavi: logging: Fix path to get_secret.sh [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/823 (https://phabricator.wikimedia.org/T386480) [10:00:48] (03update) 10taavi: logging: Fix path to get_secret.sh [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/823 (https://phabricator.wikimedia.org/T386480) [10:01:29] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component logging [10:03:55] !log taavi@cloudcumin1001 toolsbeta END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component logging [10:04:18] (03update) 10taavi: logging: Fix path to get_secret.sh [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/823 (https://phabricator.wikimedia.org/T386480) [10:04:19] (03update) 10taavi: logging: loki: Add missing emptyDir mounts in toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/824 (https://phabricator.wikimedia.org/T386480) [10:04:21] (03open) 10taavi: logging: loki: Add missing emptyDir mounts in toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/824 (https://phabricator.wikimedia.org/T386480) [10:04:25] (03update) 10taavi: logging: loki: Add missing emptyDir mounts in toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/824 (https://phabricator.wikimedia.org/T386480) [10:04:29] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component logging [10:04:42] !log taavi@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component logging [10:10:38] (03update) 10taavi: logging: loki: Add missing emptyDir mounts in toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/824 (https://phabricator.wikimedia.org/T386480) [10:16:55] (03update) 10taavi: logging: Fix path to get_secret.sh [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/823 (https://phabricator.wikimedia.org/T386480) [10:16:55] (03update) 10taavi: logging: loki: Add missing emptyDir mounts in toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/824 (https://phabricator.wikimedia.org/T386480) [10:16:55] (03open) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [10:16:56] (03update) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [10:16:57] (03open) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 [10:16:58] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 [10:17:09] (03update) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [10:17:13] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 [10:26:51] (03update) 10taavi: logging: Fix path to get_secret.sh [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/823 (https://phabricator.wikimedia.org/T386480) [10:26:51] (03update) 10taavi: logging: loki: Add missing emptyDir mounts in toolsbeta [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/824 (https://phabricator.wikimedia.org/T386480) [10:26:52] (03update) 10taavi: logging: loki: Set nameOverride [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/825 (https://phabricator.wikimedia.org/T386480) [10:26:53] (03open) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [10:26:54] (03update) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [10:26:55] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 (https://phabricator.wikimedia.org/T386480) [10:27:01] (03update) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [10:27:04] (03update) 10taavi: logging: alloy: Fix loki write service name [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/826 (https://phabricator.wikimedia.org/T386480) [10:30:37] (03update) 10taavi: logging: loki: Add network policy rule for object storage access [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/827 (https://phabricator.wikimedia.org/T386480) [10:45:17] 06cloud-services-team: openstack: mirror cloudrabbit setup from eqiad1 to codfw1dev - https://phabricator.wikimedia.org/T377934#10941722 (10Aklapper) Setting project tag to #cloud-services-team for reeval as this open task has not other //active// project tags otherwise [10:49:27] 06cloud-services-team, 10Cloud-VPS: openstack: mirror cloudrabbit setup from eqiad1 to codfw1dev - https://phabricator.wikimedia.org/T377934#10941740 (10fnegri) Thanks @Aklapper, adding #cloud-vps as well. [10:51:33] 06cloud-services-team, 10Cloud-VPS: openstack: mirror cloudrabbit setup from eqiad1 to codfw1dev - https://phabricator.wikimedia.org/T377934#10941743 (10fnegri) @Andrew is this actually completed? If yes, please resolve this task. [10:53:56] 06cloud-services-team, 10Cloud-VPS: tofu-infra: implement some state backup mechanism - https://phabricator.wikimedia.org/T389964#10941746 (10fnegri) a:05aborrero→03None [10:53:57] 06cloud-services-team, 10Cloud-VPS, 07Epic: Cloud VPS: extend tofu-infra coverage - https://phabricator.wikimedia.org/T370037#10941747 (10fnegri) a:05aborrero→03None [10:53:58] 06cloud-services-team, 10Toolforge: lima-kilo: container image caching - https://phabricator.wikimedia.org/T362967#10941748 (10fnegri) a:05aborrero→03None [10:54:00] 06cloud-services-team, 10Cloud-VPS, 07Epic: tofu-infra: opentofu-created flavors may be disabled by default - https://phabricator.wikimedia.org/T391252#10941749 (10fnegri) a:05aborrero→03None [10:54:03] 06cloud-services-team, 10Toolforge: [k8s,kyverno]: explore change from per-namespace policy resource to a single ClusterPolicy resource - https://phabricator.wikimedia.org/T368135#10941750 (10fnegri) a:05aborrero→03None [11:07:13] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [11:09:36] (03PS1) 10Arendpieter: Remove support for SUL 'realname' field. [labs/striker] - 10https://gerrit.wikimedia.org/r/1163331 (https://phabricator.wikimedia.org/T384206) [11:10:16] (03update) 10dcaro: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 [11:15:12] (03update) 10dcaro: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 [11:15:24] (03update) 10dcaro: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 [11:20:04] 06cloud-services-team, 10Striker: Use IDP for authentication in Striker - https://phabricator.wikimedia.org/T359554#10941815 (10Arendpieter) [11:21:26] (03update) 10dcaro: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 [11:22:08] (03update) 10dcaro: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] (create_runtime) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) [11:26:48] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [11:30:29] (03update) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] (add_all_continuous_options) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [11:31:56] (03open) 10taavi: Stop setting project ID when not needed [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 [11:32:00] (03update) 10taavi: Stop setting project ID when not needed [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 [11:33:26] 10Tools: Improving the New-Q5 web application - https://phabricator.wikimedia.org/T337005#10941838 (10Aklapper) 05Open→03Resolved Closing per last comment [11:33:45] (03update) 10taavi: Stop setting project ID when not needed [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 [11:43:25] (03CR) 10Majavah: "recheck" [labs/striker] - 10https://gerrit.wikimedia.org/r/1163331 (https://phabricator.wikimedia.org/T384206) (owner: 10Arendpieter) [11:45:09] (03approved) 10dcaro: Stop setting project ID when not needed [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 (owner: 10taavi) [11:45:59] (03update) 10taavi: Stop setting project ID when not needed [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 [11:46:05] (03merge) 10taavi: Stop setting project ID when not needed [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/54 [11:53:31] 06cloud-services-team, 10Toolforge: [toolforge,infra] Cntralized logging for Toolforge infrastructure logs - https://phabricator.wikimedia.org/T97861#10941876 (10taavi) a:03taavi [11:53:36] (03open) 10taavi: logging: Deploy remaining Loki buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/55 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [11:53:39] (03update) 10taavi: logging: Deploy remaining Loki buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/55 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [11:53:57] (03update) 10taavi: logging: Deploy remaining Loki buckets [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/55 (https://phabricator.wikimedia.org/T386480 https://phabricator.wikimedia.org/T97861) [12:03:37] (03merge) 10taavi: toolforge: Install real `become` from misctools [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/248 [12:03:51] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#10941960 (10Aklapper) a:05Jclark-ctr→03None @Jclark-ctr Removing task assignee as this open task has been assigned for more than two years... [12:05:37] 10Tool-refill: Toolforge: refill doesn't work on Wikipedia language versions other than English - https://phabricator.wikimedia.org/T295327#10942015 (10Aklapper) a:05Curb_Safe_Charmer→03None @Curb_Safe_Charmer Removing task assignee as this open task has been assigned for more than two years - See the email... [12:06:23] 06cloud-services-team, 10Toolforge: Store state information for the disable tool process outside NFS - https://phabricator.wikimedia.org/T332514#10942040 (10Aklapper) a:05Andrew→03None @Andrew Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05... [12:08:02] 06cloud-services-team, 10Toolforge: [jobs-cli,jobs-api] make API and CLI key/values coherent - https://phabricator.wikimedia.org/T327280#10942087 (10Aklapper) a:05Raymond_Ndibe→03None @Raymond_Ndibe Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2... [12:08:52] 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-api] Add all missing options for scheduled components - https://phabricator.wikimedia.org/T395071#10942107 (10dcaro) 05Open→03In progress [12:08:58] 10Toolforge (Toolforge iteration 21): [components-api] Add support for scheduled components - https://phabricator.wikimedia.org/T395065#10942109 (10dcaro) a:03dcaro [12:09:01] 10Toolforge (Toolforge iteration 21): [components-api] Add support for scheduled components - https://phabricator.wikimedia.org/T395065#10942111 (10dcaro) 05Open→03In progress [12:11:25] (03update) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] (add_all_continuous_options) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [12:11:39] (03update) 10dcaro: components: add test for the generate feature [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/822 [12:20:20] (03approved) 10taavi: components-api: deploy also on tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/785 (owner: 10dcaro) [12:21:22] (03update) 10dcaro: components-api: deploy also on tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/785 [12:21:52] 10VPS-Projects: Cleanup memberships of maps project - https://phabricator.wikimedia.org/T323412#10942174 (10Aklapper) a:05TheDJ→03None @TheDJ: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this task to yourself again if yo... [12:22:37] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [12:22:41] !log dcaro@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component components-api [12:22:57] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [12:23:16] 10Tool-masto-collab: masto-collab: Support embedding Commons media - https://phabricator.wikimedia.org/T336121#10942232 (10Aklapper) a:05Legoktm→03None @Legoktm: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this task to y... [12:23:26] (03merge) 10dcaro: components-api: deploy also on tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/785 [12:25:07] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [12:29:13] 06cloud-services-team, 10Toolforge: [components-api] Deployment token should not be a GET param - https://phabricator.wikimedia.org/T397712 (10taavi) 03NEW [12:32:20] 06cloud-services-team, 10Cloud-VPS, 06DC-Ops, 10ops-eqiad, 06SRE: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#10942550 (10Jclark-ctr) @Aklapper @ayounsi I hadn’t commented earlier because we needed to verify onsite that we still had enough available por... [12:34:16] (03open) 10dcaro: api-gateway: enable components-api [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/828 [12:38:40] (03update) 10dcaro: api-gateway: enable components-api [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/828 [12:44:55] (03approved) 10dcaro: api-gateway: enable components-api [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/828 [12:44:57] (03merge) 10dcaro: api-gateway: enable components-api [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/828 [12:45:32] (03open) 10ladsgroup: Update ES switchover script [toolforge-repos/switchmaster] - 10https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/merge_requests/11 (https://phabricator.wikimedia.org/T397628) [12:45:47] (03update) 10ladsgroup: Update ES switchover script [toolforge-repos/switchmaster] - 10https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/merge_requests/11 (https://phabricator.wikimedia.org/T397628) [12:56:48] (03open) 10dcaro: components-api: use internal api endpoint to talk to toolforge [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/829 [13:01:53] (03approved) 10dcaro: components-api: use internal api endpoint to talk to toolforge [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/829 [13:01:55] (03merge) 10dcaro: components-api: use internal api endpoint to talk to toolforge [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/829 [13:02:37] (03open) 10dcaro: api-gateway: add components-api as superuser [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/830 [13:02:50] (03approved) 10dcaro: api-gateway: add components-api as superuser [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/830 [13:02:51] (03merge) 10dcaro: api-gateway: add components-api as superuser [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/830 [13:08:44] (03open) 10dcaro: CI: add deployment to tools [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/5 [13:11:55] (03open) 10dcaro: schemas: delete not needed type-ignore [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/6 [13:13:28] (03approved) 10dcaro: schemas: delete not needed type-ignore [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/6 [13:13:30] (03merge) 10dcaro: schemas: delete not needed type-ignore [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/6 [13:13:44] (03update) 10dcaro: CI: add deployment to tools [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/5 [13:22:19] (03approved) 10dcaro: CI: add deployment to tools [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/5 [13:22:25] (03merge) 10dcaro: CI: add deployment to tools [toolforge-repos/sample-complex-app-backend] - 10https://gitlab.wikimedia.org/toolforge-repos/sample-complex-app-backend/-/merge_requests/5 [13:23:18] 10Cloud-VPS (Quota-requests): Quota increase required - https://phabricator.wikimedia.org/T397716 (10jnuche) 03NEW [13:27:36] 10Toolforge (Toolforge iteration 21): [components-api] deploy on tools - https://phabricator.wikimedia.org/T394337#10942829 (10dcaro) 05In progress→03Resolved [13:29:04] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): [components-cli] Deploy to tools - https://phabricator.wikimedia.org/T397718 (10taavi) 03NEW [13:29:24] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21): [components-cli] Deploy to tools - https://phabricator.wikimedia.org/T397718#10942849 (10taavi) [13:29:29] 10Toolforge (Toolforge iteration 21): [components-api] deploy on tools - https://phabricator.wikimedia.org/T394337#10942850 (10taavi) [13:50:16] (03update) 10dcaro: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 [13:53:18] (03update) 10fnegri: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 (owner: 10dcaro) [13:53:29] (03approved) 10fnegri: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 (owner: 10dcaro) [13:53:41] (03merge) 10dcaro: runtime: create runtime module to handle actions [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/88 [13:53:45] (03update) 10dcaro: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) [13:55:51] (03update) 10dcaro: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) [13:56:02] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.121-20250624135356-3eb4ef22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/831 [13:56:05] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.121-20250624135356-3eb4ef22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/831 [13:59:22] FIRING: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:04:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:15:07] 10Toolforge (Toolforge iteration 21), 07good first task: [components-cli] bash autocomplete does not autocomplete file name when creating config - https://phabricator.wikimedia.org/T395077#10943069 (10Chuckonwumelu) a:03Chuckonwumelu [14:15:11] 10Toolforge (Toolforge iteration 21), 07good first task: [components-cli] bash autocomplete does not autocomplete file name when creating config - https://phabricator.wikimedia.org/T395077#10943071 (10Chuckonwumelu) 05Open→03In progress [14:20:20] 06cloud-services-team, 10Toolforge: [components-api] Provide a standalone version of tool config schema - https://phabricator.wikimedia.org/T397724 (10taavi) 03NEW [14:22:43] 06cloud-services-team, 10Toolforge: [components-api] Provide a standalone version of tool config schema - https://phabricator.wikimedia.org/T397724#10943102 (10dcaro) [14:22:46] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, and 2 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#10943103 (10dcaro) [14:38:59] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 tools-prometheus-8 stopped responding for a bit - https://phabricator.wikimedia.org/T397563#10943132 (10fnegri) Related: {T397566} [14:44:17] 10Toolforge (Toolforge iteration 21): [components-cli,toolforge-cli] add shortcuts to top-level cli for deploy/config - https://phabricator.wikimedia.org/T397725 (10dcaro) 03NEW [14:51:52] 06cloud-services-team, 10Cloud-VPS: openstack: mirror cloudrabbit setup from eqiad1 to codfw1dev - https://phabricator.wikimedia.org/T377934#10943197 (10Andrew) 05Open→03Resolved a:03Andrew This is done! [14:55:44] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 13Patch-For-Review: [components-cli] Deploy to tools - https://phabricator.wikimedia.org/T397718#10943224 (10taavi) 05Open→03Resolved [14:55:57] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 21), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project, and 2 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#10943228 (10taavi) [14:57:09] (03update) 10dcaro: components-api: bump to 0.0.121-20250624135356-3eb4ef22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/831 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [14:57:30] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [15:01:30] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [15:02:21] !log komla@cloudcumin1001 mwoffliner START - Cookbook wmcs.openstack.quota_increase (T396840) [15:02:24] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [15:02:25] T396840: Increase RAM quota of mwoffliner project - https://phabricator.wikimedia.org/T396840 [15:02:28] !log komla@cloudcumin1001 mwoffliner END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T396840) [15:03:20] 10Cloud-VPS (Quota-requests), 07affects-Kiwix-and-openZIM: Increase RAM quota of mwoffliner project - https://phabricator.wikimedia.org/T396840#10943291 (10komla) This has been done: ` 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i wmcs-ope...-cloud novaadmin'. 100.0% (1/1) success... [15:03:22] 10Cloud-VPS (Quota-requests), 07affects-Kiwix-and-openZIM: Increase RAM quota of mwoffliner project - https://phabricator.wikimedia.org/T396840#10943292 (10komla) 05Open→03Resolved [15:05:10] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [15:05:44] (03approved) 10dcaro: components-api: bump to 0.0.121-20250624135356-3eb4ef22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/831 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:05:46] (03merge) 10dcaro: components-api: bump to 0.0.121-20250624135356-3eb4ef22 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/831 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [15:06:26] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-33 [15:10:05] (03open) 10dcaro: components-api: enable deploy tests is tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/832 [15:10:42] (03approved) 10dcaro: components-api: enable deploy tests is tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/832 [15:10:44] (03merge) 10dcaro: components-api: enable deploy tests is tools [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/832 [15:12:29] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-33 [15:14:04] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-33 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [15:15:52] 10Tool-centralnotice-banner-editor: Learn Vue - https://phabricator.wikimedia.org/T397729 (10MHorsey-WMF) 03NEW [15:23:32] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] (generate_config) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [15:24:10] 10Wikibugs: Wikibugs not reporting Phabricator activity to #wikimedia-zuul as hoped - https://phabricator.wikimedia.org/T396387#10943385 (10bd808) Working now apparently? https://wm-bot.wmcloud.org/logs/%23wikimedia-zuul/20250620.txt `lang=irc [20:27] < wikibugs> Continuous-Integration-Infrastructure (Zuul upg... [15:29:21] (03update) 10dcaro: scheduled: add scheduled component support [repos/cloud/toolforge/components-api] (add_all_continuous_options) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/94 (https://phabricator.wikimedia.org/T395071) [15:30:34] 10Toolforge (Toolforge iteration 21): [infra] 2025-06-21 tools-prometheus-8 stopped responding for a bit - https://phabricator.wikimedia.org/T397563#10943410 (10fnegri) This happened a few times over the past two weeks, always on the active node (the active node was flipped from -8 to -9 yesterday): {F62445238} [15:40:32] (03update) 10dcaro: build: fail if ref failed to resolve [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/96 [15:50:10] (03update) 10dcaro: deploy_task: store error when build fails [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/92 [16:01:37] (03approved) 10fnegri: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) (owner: 10dcaro) [16:07:11] (03update) 10dcaro: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) [16:07:12] (03update) 10dcaro: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) [16:08:57] (03merge) 10dcaro: config: add endpoint to generate sample config [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/90 (https://phabricator.wikimedia.org/T394753) [16:08:59] (03update) 10dcaro: deploy: add all the missing options for continuous job [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/93 (https://phabricator.wikimedia.org/T395070) [16:11:17] (03update) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.122-20250624160905-00d6b4c5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/833 (https://phabricator.wikimedia.org/T394753) [16:11:21] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: components-api: bump to 0.0.122-20250624160905-00d6b4c5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/833 (https://phabricator.wikimedia.org/T394753) [16:14:56] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:19:14] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [16:19:32] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component components-api [16:23:10] 06cloud-services-team, 10Cloud-VPS: Un-attachable volume in account-creation-assistance, 'app-www' - https://phabricator.wikimedia.org/T397517#10943702 (10Andrew) This is looking like it might be upstream bug https://bugs.launchpad.net/ubuntu/+source/nova/+bug/2020111 aka https://bugs.launchpad.net/charm-nov... [16:23:53] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-api [16:25:42] (03approved) 10dcaro: components-api: bump to 0.0.122-20250624160905-00d6b4c5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/833 (https://phabricator.wikimedia.org/T394753) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [16:25:46] (03merge) 10dcaro: components-api: bump to 0.0.122-20250624160905-00d6b4c5 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/833 (https://phabricator.wikimedia.org/T394753) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [17:00:33] 10Cloud-VPS (Quota-requests): Quota increase required for Catalyst - https://phabricator.wikimedia.org/T397716#10943949 (10Aklapper) [17:01:49] (03approved) 10fnegri: generate: add new subcommand [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/38 (owner: 10dcaro) [17:02:20] (03update) 10dcaro: generate: add new subcommand [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/38 [19:04:10] 06cloud-services-team, 10Cloud-VPS: Un-attachable volume in account-creation-assistance, 'app-www' - https://phabricator.wikimedia.org/T397517#10944408 (10Andrew) I've become convinced that this was caused by openstack sometimes failing and leaving an RBD lock that's subequently invisible to openstack. And, in... [19:34:18] 10Cloud-VPS (Project-requests): Request creation of lemmy VPS project - https://phabricator.wikimedia.org/T396948#10944507 (10komla) @Gryllida any updates on this? [20:13:03] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [20:13:50] FIRING: [44x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:17:56] FIRING: SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudcontrol1007 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:19:42] PROBLEM - Host cloudrabbit1002 is DOWN: PING CRITICAL - Packet loss = 100% [20:20:34] RECOVERY - Host cloudrabbit1002 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [20:25:31] FIRING: PuppetAgentStaleLastRun: Last Puppet run was over 24 hours ago on instance cvn-app10 in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [20:27:45] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [20:28:26] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [20:31:47] FIRING: [3x] ProbeDown: Service api.svc.toolforge.org:443 has failed probes (http_api_svc_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:35:28] FIRING: [3x] InstanceDown: Project cvn instance cvn-apache11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:35:28] FIRING: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:35:28] FIRING: InstanceDown: Project tools instance tools-proxy-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:35:32] FIRING: InstanceDown: Project toolsbeta instance toolsbeta-proxy-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:35:35] FIRING: TargetDown: Job frontproxy-nginx is unreachable in project toolsbeta instance toolsbeta-proxy-8 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [20:37:10] FIRING: ProjectProxyMainProxyInstanceDown: Proxy on proxy-6 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyInstanceDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyInstanceDown [20:38:22] FIRING: HAProxyBackendUnavailable: HAProxy service neutron-api_backend backend cloudcontrol1007.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:38:50] FIRING: [2x] NeutronAgentDown: Neutron neutron-l3-agent on cloudnet1006 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [20:39:28] FIRING: TargetDown: Job main-nginx is unreachable in project project-proxy instance proxy-6 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [20:40:28] FIRING: InstanceDown: Project project-proxy instance proxy-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:40:28] FIRING: [2x] InstanceDown: Project toolsbeta instance toolsbeta-prometheus-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:40:28] FIRING: [2x] InstanceDown: Project tools instance tools-prometheus-8 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:41:47] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:42:28] FIRING: InstanceDown: Project metricsinfra instance metricsinfra-grafana-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:44:45] FIRING: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [20:44:45] FIRING: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [20:44:52] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [20:44:53] FIRING: ProbeDown: Service toolsbeta-static-2:80 has failed probes (http_toolsbeta_static_wmcloud_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#toolsbeta-static-2:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:45:10] FIRING: JobsEmailerDown: JobsEmailer is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerDown [20:45:10] FIRING: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [20:45:27] FIRING: EnvvarsApiDown: EnvvarsApi is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiDown [20:45:28] FIRING: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [20:45:28] RESOLVED: InstanceDown: Project tools instance tools-proxy-10 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project cvn - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [20:45:31] FIRING: [3x] InstanceDown: Project cvn instance cvn-apache11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:35] RESOLVED: InstanceDown: Project toolsbeta instance toolsbeta-proxy-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:45:39] RESOLVED: TargetDown: Job frontproxy-nginx is unreachable in project toolsbeta instance toolsbeta-proxy-7 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [20:45:43] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [20:45:50] FIRING: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:45:54] FIRING: BuildsApiDown: BuildsApi is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DBuildsApiDown [20:45:58] FIRING: ComponentsApiDown: ComponentsApi is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ComponentsApiDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DComponentsApiDown [20:46:02] FIRING: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [20:46:47] FIRING: [14x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:49:45] RESOLVED: ToolforgeKubernetesHAproxyUnknown: Toolforge HAproxy has unknown state. HAproxy might be down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyUnknown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyUnknown [20:49:45] RESOLVED: ToolforgeKubernetesHAproxyServerDown: Toolforge HAproxy server down: - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown - https://grafana.wmcloud.org/d/toolforge-k8s-haproxy/toolforge-k8s-haproxy?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesHAproxyServerDown [20:49:52] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [20:49:53] RESOLVED: [6x] ProbeDown: Service api.svc.beta.toolforge.org:443 has failed probes (http_api_svc_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:50:10] RESOLVED: JobsEmailerDown: JobsEmailer is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/JobsEmailerDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DJobsEmailerDown [20:50:10] RESOLVED: HarborComponentDown: No data about Harbor components found. #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/HarborComponentDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DHarborComponentDown [20:50:27] RESOLVED: EnvvarsApiDown: EnvvarsApi is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsApiDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsApiDown [20:50:28] RESOLVED: TektonDown: Tekton is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/TektonDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTektonDown [20:50:28] RESOLVED: [3x] InstanceDown: Project cvn instance cvn-apache11 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:50:30] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [20:50:38] RESOLVED: ToolforgeKubernetesNodeNotReady: Multiple Kubernetes nodes are not ready #page - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:50:42] RESOLVED: BuildsApiDown: BuildsApi is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DBuildsApiDown [20:50:46] RESOLVED: ComponentsApiDown: ComponentsApi is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ComponentsApiDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DComponentsApiDown [20:50:51] RESOLVED: EnvvarsAdmissionDown: EnvvarsAdmission is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/EnvvarsAdmissionDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DEnvvarsAdmissionDown [20:50:51] !log andrew@cloudcumin1001 admin END (ERROR) - Cookbook wmcs.openstack.restart_openstack (exit_code=97) on deployment eqiad1 for all services [20:55:26] FIRING: SystemdUnitDown: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [20:56:47] FIRING: [18x] ProbeDown: Service api.svc.toolforge.org:443 has failed probes (http_api_svc_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:56:58] RESOLVED: InstanceDown: Project project-proxy instance proxy-6 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:58:40] FIRING: [2x] ProjectProxyMainProxyInstanceDown: Proxy on proxy-5 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyInstanceDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyInstanceDown [20:59:08] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:59:58] RESOLVED: InstanceDown: Project metricsinfra instance metricsinfra-grafana-2 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:59:58] FIRING: [4x] InstanceDown: Project project-proxy instance maps-proxy-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:59:58] FIRING: [3x] TargetDown: Job main-nginx is unreachable in project project-proxy instance proxy-6 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [21:00:08] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:00:26] FIRING: [3x] SystemdUnitDown: The service unit prometheus-node-textfile-wmcs-dnsleaks.service is in failed status on host cloudcontrol1007. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:00:28] FIRING: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [21:00:43] FIRING: [4x] InstanceDown: Project tools instance tools-legacy-redirector-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:00:59] FIRING: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [21:01:08] PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:08] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:09] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:09] PROBLEM - nova-compute proc minimum on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:18] 06cloud-services-team: MetricsinfraAlertmanagerDown Metricsinfra alertmanager is unreachable # page - https://phabricator.wikimedia.org/T397782 (10phaultfinder) 03NEW [21:01:18] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:18] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:20] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:20] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:20] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:21] PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:24] PROBLEM - nova-compute proc minimum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:24] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:34] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:40] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:41] PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:41] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:41] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:42] PROBLEM - nova-compute proc minimum on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:43] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:44] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:18] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:20] RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:03:20] PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:03:40] RESOLVED: [2x] ProjectProxyMainProxyInstanceDown: Proxy on proxy-5 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MainProxyInstanceDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProjectProxyMainProxyInstanceDown [21:04:20] FIRING: [44x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:04:40] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:04:58] RESOLVED: [3x] InstanceDown: Project project-proxy instance maps-proxy-5 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:04:58] RESOLVED: [5x] TargetDown: Job main-nginx is unreachable in project project-proxy instance proxy-5 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [21:05:08] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [21:05:08] RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:08] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:09] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:09] RECOVERY - nova-compute proc minimum on cloudvirt1072 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:18] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:20] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:20] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:21] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:21] RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:24] RECOVERY - nova-compute proc minimum on cloudvirt1066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:24] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:26] FIRING: [2x] SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:05:28] RESOLVED: WidespreadInstanceDown: Widespread instances down in project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [21:05:34] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:40] RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:41] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:41] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:41] RECOVERY - nova-compute proc minimum on cloudvirt1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:42] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:43] RESOLVED: InstanceDown: Project tools instance tools-legacy-redirector-3 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [21:05:44] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:06:47] RESOLVED: [8x] ProbeDown: Service tools-legacy-redirector-3:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-3:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:09:07] RESOLVED: [7x] HAProxyBackendUnavailable: HAProxy service glance-api_backend backend cloudcontrol1006.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:09:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:13:49] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [21:17:02] FIRING: [9x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:17:29] RESOLVED: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [21:18:10] RESOLVED: [9x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:20:26] FIRING: [4x] SystemdUnitDown: The service unit drain_rabbitmq_notification_error.service is in failed status on host cloudrabbit1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [21:23:34] 06cloud-services-team, 10Cloud-VPS: Rabbitmq, neutron-openvswitch-agent, and network outages - https://phabricator.wikimedia.org/T397783 (10Andrew) 03NEW [21:23:44] 06cloud-services-team, 10Cloud-VPS: Rabbitmq, neutron-openvswitch-agent, and network outages - https://phabricator.wikimedia.org/T397783#10945024 (10Andrew) p:05Triage→03High [21:24:20] RESOLVED: [44x] NeutronAgentDown: Neutron neutron-openvswitch-agent on cloudnet1005 is down - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Networking_failures - https://grafana.wikimedia.org/d/wKnDJf97z/wmcs-neutron-eqiad1 - https://alerts.wikimedia.org/?q=alertname%3DNeutronAgentDown [21:25:26] RESOLVED: SystemdUnitDown: The service unit rabbitmq_detect_partition.service is in failed status on host cloudrabbit1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudrabbit1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [22:01:11] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:06:27] 06cloud-services-team, 10Cloud-VPS: Rabbitmq, neutron-openvswitch-agent, and network outages - https://phabricator.wikimedia.org/T397783#10945164 (10bd808) Something I noticed linked from https://wikitech.wikimedia.org/wiki/Incidents/2024-11-26_WMCS_network_problems when I searched Wikitech for notes on neutro... [22:22:04] FIRING: [3x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-61 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:27:04] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [22:32:04] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:12:04] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-14 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:20:07] 10Cloud-VPS (Quota-requests): Pixel project "disk40" flavor, and perhaps a few more cores? - https://phabricator.wikimedia.org/T395837#10945343 (10Mhurd) [23:20:30] 10Cloud-VPS (Quota-requests): Increase Pixel project disk quota to 160 GB - https://phabricator.wikimedia.org/T397266#10945345 (10Mhurd) [23:29:41] 06cloud-services-team, 10Toolforge: [jobs-api] logs internal datetime error - https://phabricator.wikimedia.org/T362521#10945388 (10derenrich) I think it's being caused by programs that print in weird ways (e.g. using terminal escapes). I understand the desire to not just blindly ignore these exceptions this b...