[00:05:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:15:41] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:05:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [03:15:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:22:42] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [04:41:40] 06cloud-services-team, 10Data-Services, 10Quarry: Quarry WMCloud (ruwiki_p, section s6) experiencing sustained replication lag (~16 h) - https://phabricator.wikimedia.org/T394859#10846437 (10Marostegui) >>! In T394859#10845998, @Voyagerim wrote: > @Marostegui , expectations regarding replication latency thre... [04:42:10] 06cloud-services-team, 10Data-Services, 10Quarry, 06Data-Persistence: Quarry WMCloud (ruwiki_p, section s6) experiencing sustained replication lag (~16 h) - https://phabricator.wikimedia.org/T394859#10846438 (10Marostegui) 05Open→03Resolved a:03Marostegui Lag is back to 0 [04:50:56] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [05:21:19] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [05:52:08] 10Tools: https://linksearch.toolforge.org/ leads to a 404 page - https://phabricator.wikimedia.org/T265381#10846606 (10Giftpflanze) 05Open→03Declined [06:00:25] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [07:27:35] (03merge) 10taavi: Migrate traffic to tools-proxy-9 [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/43 (https://phabricator.wikimedia.org/T211575) [07:27:36] (03update) 10taavi: Add AAAA record for *.toolforge.org [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/44 (https://phabricator.wikimedia.org/T211575) [07:30:33] (03merge) 10taavi: Add AAAA record for *.toolforge.org [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/44 (https://phabricator.wikimedia.org/T211575) [07:40:00] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10846817 (10dcaro) a:05dcaro→03None Wa [07:40:31] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade hosts to bullseye - https://phabricator.wikimedia.org/T309789#10846821 (10dcaro) a:03dcaro This is still ongoing work :/ [07:40:37] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network - https://phabricator.wikimedia.org/T329778#10846823 (10dcaro) a:05dcaro→03None [07:42:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240#10846828 (10dcaro) 05In progress→03Resolved I'll close this for now, it's been a whil... [07:44:30] 06cloud-services-team, 10Toolforge, 07IPv6: Enable IPv6 on toolforge.org - https://phabricator.wikimedia.org/T211575#10846832 (10taavi) 05Open→03Resolved There's an AAAA record that works for me, so optimistically closing. [07:54:14] 06cloud-services-team, 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [ceph] Enable disk failure prediciton - https://phabricator.wikimedia.org/T349694#10846843 (10dcaro) [07:54:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance, 05Goal: [ceph] Upgrade to v16 - https://phabricator.wikimedia.org/T306820#10846844 (10dcaro) [07:57:10] 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [cicd] create cicd flow for non repo owners - https://phabricator.wikimedia.org/T394595#10846846 (10dcaro) p:05Triage→03Medium [07:58:04] 10Toolforge (Toolforge iteration 20): mypy x509 invalid syntax while running CI tests - https://phabricator.wikimedia.org/T394593#10846848 (10dcaro) 05Open→03Resolved a:03dcaro I'll close this as I don't see any more failures, if there's any feel free to reopen with more details. [07:58:43] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-prometheus-1 [07:58:49] !log taavi@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=1) for instance toolsbeta-prometheus-1 [07:59:27] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-prometheus-1 [08:00:15] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-prometheus-1 [08:00:30] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-proxy-5 [08:00:54] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-proxy-5 [08:01:11] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.vps.remove_instance for instance toolsbeta-proxy-6 [08:01:37] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance toolsbeta-proxy-6 [08:01:55] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-proxy-7 [08:02:55] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-proxy-7 [08:03:03] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.vps.remove_instance for instance tools-proxy-8 [08:04:03] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-proxy-8 [08:04:31] FIRING: [2x] ProbeDown: Service toolsbeta-proxy-5:443 has failed probes (http_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:06:02] FIRING: ProbeDown: Service tools-proxy-7:443 has failed probes (http_toolforge_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-proxy-7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:06:39] ^ just deleted instances, will clear soon, sorry [08:09:31] RESOLVED: [2x] ProbeDown: Service toolsbeta-proxy-5:443 has failed probes (http_beta_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:11:02] FIRING: [3x] ProbeDown: Service api.svc.toolforge.org:443 has failed probes (http_api_svc_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:16:02] RESOLVED: [3x] ProbeDown: Service api.svc.toolforge.org:443 has failed probes (http_api_svc_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [08:27:36] 10Toolforge (Toolforge iteration 20), 07good first task: [components-api] add `GET` endpoint `/v1/tool//deployments/latest` - https://phabricator.wikimedia.org/T394990 (10dcaro) 03NEW Thank you for tagging this task with #good_first_task for Wikimedia newcomers! Newcomers often may not be aware of... [08:27:37] 10Toolforge (Toolforge iteration 20), 07good first task: [components-api] add `GET` endpoint `/v1/tool//deployments/latest` - https://phabricator.wikimedia.org/T394990#10846966 (10dcaro) p:05Triage→03Medium [08:35:41] FIRING: CloudVPSDesignateLeaks: Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:35:56] (03open) 10dcaro: readme: add link to packaging docs [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/32 [08:41:45] 10Toolforge (Toolforge iteration 20), 07good first task: [components-cli] make `toolforge components deployment show` show the latest deployment if no id passed - https://phabricator.wikimedia.org/T394994 (10dcaro) 03NEW Thank you for tagging this task with #good_first_task for Wikimedia newcomers! Newcomer... [08:42:02] 10Toolforge (Toolforge iteration 20), 07good first task: [components-api] add `GET` endpoint `/v1/tool//deployments/latest` - https://phabricator.wikimedia.org/T394990#10847062 (10dcaro) [08:42:03] 10Toolforge (Toolforge iteration 20), 07good first task: [components-cli] make `toolforge components deployment show` show the latest deployment if no id passed - https://phabricator.wikimedia.org/T394994#10847061 (10dcaro) [08:43:10] 10Toolforge (Toolforge iteration 20), 07good first task: [components-cli] make `toolforge components deployment show` show the latest deployment if no id passed - https://phabricator.wikimedia.org/T394994#10847069 (10dcaro) p:05Triage→03Medium [08:50:31] FIRING: ToolsNfsAlmostFull: Toolforge NFS is 0.8618727276438868/1 full - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNfsAlmostFull - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNfsAlmostFull [08:58:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [09:02:44] 10Toolforge (Toolforge iteration 20): [infra] move toolsbeta to `toolsbeta.org` domain - https://phabricator.wikimedia.org/T394997 (10dcaro) 03NEW [09:02:51] 10Toolforge (Toolforge iteration 20): [infra] move toolsbeta to `toolsbeta.org` domain - https://phabricator.wikimedia.org/T394997#10847162 (10dcaro) p:05Triage→03Low [09:08:05] 10Toolforge (Toolforge iteration 20): [components-api] Add basic prometheus metrics - https://phabricator.wikimedia.org/T394276#10847196 (10dcaro) [09:08:10] 06cloud-services-team, 10Toolforge: 2025-05-22 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T395000 (10taavi) 03NEW [09:09:42] 06cloud-services-team, 10Toolforge: 2025-05-22 Toolforge NFS cleanup - https://phabricator.wikimedia.org/T395000#10847226 (10taavi) a:03taavi [09:10:30] (03PS4) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 [09:10:30] (03CR) 10Klausman: "While I am sure this (or rather it's actual-pricate-repo equivalent) is necessary, I am not sure it is sufficient to make the private file" [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [09:23:12] 10Toolforge (Toolforge iteration 20): [components-api] Add alerts and runbooks for basic service health - https://phabricator.wikimedia.org/T394275#10847271 (10dcaro) [09:33:06] 10Tools: Cleanup disk space of bene tool - https://phabricator.wikimedia.org/T395004 (10taavi) 03NEW [09:35:32] 10Toolforge (Toolforge iteration 20): [functional-tests,builds-builder] create a test suite to run builds for all the sample tools we have - https://phabricator.wikimedia.org/T394927#10847324 (10dcaro) [09:35:39] 06cloud-services-team, 10Toolforge: IPCheck 500 Internal Server Error - https://phabricator.wikimedia.org/T395005 (10Jeff_G) 03NEW [09:36:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 20): [infra] Toolforge bastion sssd/LDAP flakiness (May 2025) - https://phabricator.wikimedia.org/T393732#10847337 (10dcaro) [09:36:16] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 20), 07Epic, 05Goal: [infra] Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#10847338 (10dcaro) [09:36:37] 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [jobs-api] check for diff in services when running diff_with_running_job - https://phabricator.wikimedia.org/T392717#10847340 (10dcaro) [09:36:58] 06cloud-services-team, 10Toolforge (Toolforge iteration 20): [builds-builder] Upgrade python buildpack to v0.17.0 or newer for Poetry support - https://phabricator.wikimedia.org/T374056#10847341 (10dcaro) [09:37:27] 10Tools: cluebotng-staging tool uses ~560G of disk space - https://phabricator.wikimedia.org/T395006 (10taavi) 03NEW [09:37:48] 06cloud-services-team, 10Toolforge: [components-api] optionally log deployments to SAL automatically - https://phabricator.wikimedia.org/T393169#10847352 (10dcaro) [09:37:56] 06cloud-services-team, 10Toolforge, 07IPv6: [infra] Enable IPv6 for Toolforge mail server - https://phabricator.wikimedia.org/T392511#10847354 (10dcaro) [09:38:07] 06cloud-services-team, 10Toolforge: [infra] Toolforge: migrate to Debian Bookworm or later - https://phabricator.wikimedia.org/T387005#10847355 (10dcaro) [09:38:18] 10Tools: IPCheck 500 Internal Server Error - https://phabricator.wikimedia.org/T395005#10847357 (10taavi) https://toolsadmin.wikimedia.org/tools/id/ipcheck does not list an issue tracker URL, so cc:ing maintainers @MusikAnimal @SQL here. [09:44:09] 10Tools: cluebotng-staging tool uses ~560G of disk space - https://phabricator.wikimedia.org/T395006#10847374 (10taavi) [09:55:48] 10Tools: ebraminio-dev tool uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395007 (10taavi) 03NEW [10:20:01] 10Tools, 06Commons: A Python version of basic pattypan functionality - https://phabricator.wikimedia.org/T337127#10847466 (10TiagoLubiana) This was started, and the very basic functionality works: https://github.com/lubianat/pythonpan Realistically, it is unlikely to be worked on more in the future. [10:20:22] 10Tools, 06Commons: A Python version of basic pattypan functionality - https://phabricator.wikimedia.org/T337127#10847467 (10TiagoLubiana) 05Open→03Resolved [10:37:14] (03CR) 10Elukey: hiera: Add pseudosecrets for MT Thanos-Swift access (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [10:56:27] 10Tools: Improving the New-Q5 web application - https://phabricator.wikimedia.org/T337005#10847588 (101Veertje) [10:59:30] 10Tools: Improving the New-Q5 web application - https://phabricator.wikimedia.org/T337005#10847590 (101Veertje) At the hackathon 2025 I improved on the part that mages an individual's family name to a Q-item. It should now also support family names with affixes, as is common in Dutch names. Last year at the Hac... [11:12:33] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10847644 (10Marostegui) @Ladsgroup one of these hosts will go to x1 [11:19:11] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10847674 (10Ladsgroup) >>! In T394884#10847644, @Marostegui wrote: > @Ladsgroup one of these hosts will go to x1 🥳 [12:11:48] 10Tools: zoomviewer uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395020 (10taavi) 03NEW [12:19:10] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10847884 (10FCeratto-WMF) I update the script, can I run a restart on db2186 now? [12:23:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:33:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [12:35:41] FIRING: CloudVPSDesignateLeaks: Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:42:27] (03PS5) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 [12:42:45] (03CR) 10Klausman: hiera: Add pseudosecrets for MT Thanos-Swift access (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [12:43:36] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10847957 (10Marostegui) Yes, as I mentioned it at: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1131954/10#message-851448cc8613956395eb3370100c2f332c45c508 [12:44:09] 06cloud-services-team, 10Data-Services, 06DBA: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884#10847961 (10Marostegui) @FCeratto-WMF let's not hijack this task - let's continue the testing conversation on the gerrit. This task isn't about testing that cookbook. [12:58:08] (03CR) 10Elukey: [C:03+1] hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [13:04:32] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730#10848020 (10dcaro) I did a couple tests, this from my laptop (accross the internet): ` * Response time: 0.300849s 03:02 PM ~ dcaro@acme$ curl -kv -w '\n* Response time: %{time_total}s\n' h... [13:09:43] (03CR) 10Klausman: [V:03+2 C:03+2] hiera: Add pseudosecrets for MT Thanos-Swift access [labs/private] - 10https://gerrit.wikimedia.org/r/1148855 (owner: 10Klausman) [13:26:16] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730#10848113 (10dcaro) On the other side, the current deployment takes a bit more than 3s to load the main page: {F60378653} and on toolforge it takes around 0.5s less actually: {F60378659} [13:37:15] 10Toolforge (Toolforge iteration 20): [components-api] Add endpoint to get what would be the "current" config - https://phabricator.wikimedia.org/T394753#10848170 (10dcaro) [13:40:05] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [13:40:43] 10Toolforge (Toolforge iteration 20): [components-api] Add endpoint to get what would be the "current" config - https://phabricator.wikimedia.org/T394753#10848186 (10dcaro) [13:44:49] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10848217 (10Andrew) a:05Andrew→03None [13:45:28] 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 10ops-eqiad, 06SRE: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727#10848224 (10Andrew) a:05Andrew→03None [13:50:49] 10Toolforge (Toolforge iteration 20): [builds-api] populate the `image_name` for the builds returned - https://phabricator.wikimedia.org/T395035 (10dcaro) 03NEW [13:57:47] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730#10848302 (10Nokib_Sarkar) @dcaro Most of the API are behind the login wall. The only public API might be `/api/v2/campaign/`. The frontend is tested manually. Upon checking... [13:58:33] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: IPv6 for cloud-realm services - https://phabricator.wikimedia.org/T379282#10848313 (10taavi) >>! In T379282#10768594, @cmooney wrote: >>>! In T379282#10767585, @taavi wrote: >> in order to ensure those hosts don't use their cloud-hosts v6 con... [13:58:38] 10Tools: zoomviewer uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395020#10848314 (10dschwen) I'll get on it. On a philosophical level I wonder what "reasonable" is. I agree that 1.2T is a lot, but it is by nature a storage hungry tool that processes many images on Commons, and I... [14:01:39] !log test [14:01:40] dcaro: Missing project or message? Expected !log [14:05:26] 06cloud-services-team, 10Toolforge: [components-api] optionally log deployments to SAL automatically - https://phabricator.wikimedia.org/T393169#10848354 (10dcaro) We sholud probably use the #wikimedia-cloud-feed channel for this I think we have connectivity to get the sal message sent directly from the compo... [14:05:46] 06cloud-services-team, 10Toolforge (Toolforge iteration 20): [components-api] optionally log deployments to SAL automatically - https://phabricator.wikimedia.org/T393169#10848362 (10dcaro) [14:26:07] 06cloud-services-team, 10Data-Services, 06DBA, 10Wikidata, and 2 others: Set up x3 replication to wikireplicas - https://phabricator.wikimedia.org/T390954#10848494 (10taavi) [14:27:22] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: IPv6 for cloud-realm services - https://phabricator.wikimedia.org/T379282#10848500 (10cmooney) >>! In T379282#10848313, @taavi wrote: > Yes. The `profile::wmcs::cloud_private_subnet` profile looks up the cloud-private IP addresses of that host... [14:33:42] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: IPv6 for cloud-realm services - https://phabricator.wikimedia.org/T379282#10848561 (10taavi) >>! In T379282#10848500, @cmooney wrote: >>>! In T379282#10848313, @taavi wrote: >> Yes. The `profile::wmcs::cloud_private_subnet` profile looks up th... [14:35:31] 10Tools: cluebotng-staging tool uses ~560G of disk space - https://phabricator.wikimedia.org/T395006#10848568 (10DamianZaremba) Post https://github.com/cluebotng/botng/commit/9683902f88341a0c4bf60ac3bbd6089d9724ae9f the logs are rotated daily. They are quite large aa currently the code is running with trace lev... [14:37:12] 10Tools: zoomviewer uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395020#10848581 (10Andrew) >>! In T395020#10848314, @dschwen wrote: > I'll get on it. On a philosophical level I wonder what "reasonable" is. I agree that 1.2T is a lot, but it is by nature a storage hungry tool th... [14:37:51] 10Tools: cluebotng-staging tool uses ~560G of disk space - https://phabricator.wikimedia.org/T395006#10848582 (10taavi) Alright, I've cleared all logs from March and April. Leaving this task open until you can implement some automatic cleanup, since this will come back to alert us otherwise. Thank you for the qu... [14:39:53] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037 (10Andrew) 03NEW [14:43:10] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: IPv6 for cloud-realm services - https://phabricator.wikimedia.org/T379282#10848650 (10cmooney) >>! In T379282#10848561, @taavi wrote: > Probably not. We have a bit of a pattern for provisioning firewall rules based on IP addresses looked up on... [14:47:01] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [14:48:30] 10Toolforge (Toolforge iteration 20): [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039 (10dcaro) 03NEW [14:49:16] 10Toolforge (Toolforge iteration 20): [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039#10848702 (10dcaro) p:05Triage→03Medium [14:49:46] 10Toolforge (Toolforge iteration 20): [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039#10848704 (10Raymond_Ndibe) [14:50:52] 10Toolforge (Toolforge iteration 20): [components-api,components-cli] add `deploy cancel` feature - https://phabricator.wikimedia.org/T395039#10848712 (10dcaro) This is part of the extended beta. [14:52:23] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10848716 (10Andrew) Today: ` 'PTR fdafd939-cc0f-43d5-b960-003649366d22 is linked to missing instance toolsbeta-prometheus-1.toolsbeta.eqiad1.wikimedia.cloud.' 'PTR 931020d7-4a66-4fcf-8fac-0... [14:54:35] (03open) 10taavi: Remove toolsbeta-prometheus-1 volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/45 [14:54:40] (03update) 10taavi: Remove toolsbeta-prometheus-1 volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/45 [14:56:22] 10Tools: https://linksearch.toolforge.org/ leads to a 404 page - https://phabricator.wikimedia.org/T265381#10848719 (10Superyetkin) 05Declined→03Open The tool is still down. [14:58:14] (03update) 10taavi: Remove toolsbeta-prometheus-1 volume [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/45 [15:04:50] 10Tool-bridgebot: Test Matrix to Telegram bridge - https://phabricator.wikimedia.org/T337136#10848817 (10Tgr) a:05Tgr→03None We did the testing for t2bot.io (theoretically mautrix-telegram, but one thing we found out was that, at least at the time, it was using a quite old fork of that software). The notes a... [15:05:26] 10Tools: https://linksearch.toolforge.org/ leads to a 404 page - https://phabricator.wikimedia.org/T265381#10848824 (10Giftpflanze) 05Open→03Declined I was asked to remove myself as assignee/reassign myself/close this task as declined if it shouldn't be worked on and that's what I did. [15:12:43] 10Tool-bridgebot, 10Education-Program-Dashboard: Should ignore wikibugs messages on #wikimedia-ed - https://phabricator.wikimedia.org/T330341#10848861 (10bd808) a:05bd808→03None [15:16:23] (03update) 10chuckonwumelu: Temporary: For demo purposes only [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/28 [15:19:24] 10Data-Services, 06Data-Engineering: Create a view for existencelinks table - https://phabricator.wikimedia.org/T394898#10848921 (10taavi) [15:23:01] (03merge) 10chuckonwumelu: Temporary: For demo purposes only [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/28 [15:36:12] 06cloud-services-team, 10Toolforge: Install mariadb-dump on Toolforge bastions - https://phabricator.wikimedia.org/T378882#10849072 (10Nokib_Sarkar) A workaround might be `toolforge jobs run --command "umask o-r; ( mariadb-dump --defaults-file=~/replica.my.cnf --host=tools-readonly.db.svc.wikimedia.cloud crede... [15:41:37] 06cloud-services-team, 10Toolforge, 06WMF-Legal: Is Using sentry for error monitoring against wikimedia cloud privacy policy? - https://phabricator.wikimedia.org/T394577#10849088 (10Andrew) We discussed this in our weekly cloud services meeting. As far as we're concerned everything you're doing is just fine.... [15:42:09] 06cloud-services-team, 10Toolforge, 06WMF-Legal: Is Using sentry for error monitoring against wikimedia cloud privacy policy? - https://phabricator.wikimedia.org/T394577#10849091 (10Andrew) 05Stalled→03In progress p:05Triage→03Medium [15:50:56] 10Tool-campwiz-nxt: Migration of CampWiz NXT to toolforge - https://phabricator.wikimedia.org/T394515#10849148 (10Nokib_Sarkar) [15:55:07] (03open) 10chuckonwumelu: Revert "Temporary: For demo purposes only" [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/46 [15:55:08] (03update) 10chuckonwumelu: Revert "Temporary: For demo purposes only" [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/46 [16:02:48] 10Tools: IPCheck 500 Internal Server Error - https://phabricator.wikimedia.org/T395005#10849206 (10MusikAnimal) 05Open→03Resolved a:03MusikAnimal Restarting the webservice seemed to work. I don't see any indication of what actually went wrong. `error.log` has: ` 2025-05-22 15:49:10: http-header-glue.c... [16:16:46] (03PS2) 10Krinkle: Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 (https://phabricator.wikimedia.org/T395036) (owner: 10Slyngshede) [16:17:19] (03CR) 10CI reject: [V:04-1] Build: Update build system [labs/countervandalism/CVNBot] - 10https://gerrit.wikimedia.org/r/1143806 (https://phabricator.wikimedia.org/T395036) (owner: 10Slyngshede) [16:17:44] 10Tool-campwiz-nxt: Implement Reverse proxy and Failover server into campwiz nxt - https://phabricator.wikimedia.org/T394730#10849266 (10dcaro) Noting here to not forget, it seems that the issue (at least one of them xd) is that it's loading full images from commons instead of just getting the thumbnails. {F603... [16:20:04] (03update) 10dcaro: readme: add link to packaging docs [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/32 [16:27:46] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate False, for hosts list: ['cloudvirt1041'] [16:28:13] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate False, for hosts list: ['cloudvirt1041'] [16:28:46] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1041'] [16:29:13] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1041'] [16:29:20] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install cloudcephosd10[48-51] & relocate cloudcephosd1039 - https://phabricator.wikimedia.org/T394333#10849301 (10dcaro) For easy reading :), here's the racking info [from the parent task](https://phabricator.wikimedia.org/T38985... [16:35:41] FIRING: CloudVPSDesignateLeaks: Detected 10 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:41:52] (03merge) 10chuckonwumelu: Revert "Temporary: For demo purposes only" [repos/cloud/toolforge/tofu-provisioning] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/46 [16:52:15] 10Tools: https://linksearch.toolforge.org/ leads to a 404 page - https://phabricator.wikimedia.org/T265381#10849425 (10Superyetkin) 05Declined→03Open May I claim this? Could you please share the source code? [16:52:30] 10Tools: https://linksearch.toolforge.org/ leads to a 404 page - https://phabricator.wikimedia.org/T265381#10849427 (10Superyetkin) a:05Giftpflanze→03None [17:13:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:30:47] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1040'] [17:31:20] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1040'] [17:39:20] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1042'] [17:39:52] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1042'] [17:40:46] 10Toolforge (Toolforge iteration 20): [components-api] Add support for scheduled components - https://phabricator.wikimedia.org/T395065 (10dcaro) 03NEW [17:50:43] (03open) 10chuckonwumelu: [api] Adding warning message for beta [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/78 (https://phabricator.wikimedia.org/T394277) [18:03:18] 10Toolforge (Toolforge iteration 20): [components-api] Add resource limits options to all component types - https://phabricator.wikimedia.org/T395068 (10dcaro) 03NEW [18:06:34] 10Toolforge (Toolforge iteration 20): [components-api] add all the missing options for continuous components - https://phabricator.wikimedia.org/T395070 (10dcaro) 03NEW [18:08:27] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10849925 (10Andrew) I believe the issue is this bit of code in /usr/lib/python3/dist-packages/designate/notification_handler/base.py: ` criterion.update({ 'managed': True... [18:09:05] 10Toolforge (Toolforge iteration 20): [components-api] Add all missing options for scheduled components - https://phabricator.wikimedia.org/T395071 (10dcaro) 03NEW [18:09:44] 10Toolforge (Toolforge iteration 20): [components-api] Add resource limits options to all component types - https://phabricator.wikimedia.org/T395068#10849945 (10dcaro) 05Open→03Invalid split in two per-component including all the other options. [18:10:09] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 20), 05Cloud-Services-Origin-Team, and 3 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#10849947 (10dcaro) [18:11:03] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10849953 (10Andrew) ` [designate]> update records set managed_plugin_name='nova_fixed' where managed_plugin_name='nova_fixed_multi'; Query OK, 1063 rows affected (0.042 sec) Rows matched: 1063... [18:11:51] FIRING: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:13:07] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.006 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:13:21] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1043'] [18:13:54] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1043'] [18:14:12] 10Tools: IPCheck 500 Internal Server Error - https://phabricator.wikimedia.org/T395005#10849972 (10taavi) Honestly, not sure. That might be from some requests getting stuck and never freeing up the worker processes. Migrating to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Building_container_images whi... [18:14:23] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 13.126 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:15:41] RESOLVED: CloudVPSDesignateLeaks: Detected 6 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:16:00] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10849977 (10Andrew) That does not appear to have helped [18:16:51] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:19:53] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1044'] [18:20:26] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1044'] [18:20:34] 10Quarry: Add "wikishared" database to Quarry and to wiki replicas - https://phabricator.wikimedia.org/T395072 (10Novem_Linguae) 03NEW [18:21:51] RESOLVED: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:27:44] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1045'] [18:28:16] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1045'] [18:38:23] 06cloud-services-team, 10Data-Services, 10Quarry: Add "wikishared" database to wiki replicas - https://phabricator.wikimedia.org/T395072#10850145 (10JJMC89) [18:46:14] 10Toolforge (Toolforge iteration 20): [components-api] Add support for scheduled components - https://phabricator.wikimedia.org/T395065#10850181 (10dcaro) p:05Triage→03Medium [18:46:44] 10Toolforge (Toolforge iteration 20): [components-api] add all the missing options for continuous components - https://phabricator.wikimedia.org/T395070#10850187 (10dcaro) p:05Triage→03Medium [18:46:48] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1046'] [18:47:00] 10Toolforge (Toolforge iteration 20): [components-api] Add all missing options for scheduled components - https://phabricator.wikimedia.org/T395071#10850188 (10dcaro) p:05Triage→03Medium [18:47:19] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1046'] [18:49:11] (03update) 10raymond-ndibe: [components-api] skip build if refs are same [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/77 (https://phabricator.wikimedia.org/T389044) [18:50:44] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10850195 (10Andrew) >>! In T395037#10849977, @Andrew wrote: > That does not appear to have helped No, it worked for the forward records but not for the pointers, nova_fixed doesn't know about. [18:51:44] 10Toolforge (Toolforge iteration 20), 07good first task: [builds-api] populate the `image_name` for the builds returned - https://phabricator.wikimedia.org/T395035#10850199 (10dcaro) p:05Triage→03High Thank you for tagging this task with #good_first_task for Wikimedia newcomers! Newcomers often may not be... [18:51:48] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [18:53:22] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10850204 (10Andrew) ` [designate]> update records set managed_plugin_name='wmcs_nova_fixed_ptr' where managed_plugin_name='nova_fixed' and zone_id='6990e13949e6466c942146cf45f05842'; Query OK,... [18:53:32] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [18:54:20] !log andrew@cloudcumin1001 cloudvirt-canary START - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary on eqiad1, with recreate True, for hosts list: ['cloudvirt1047'] [18:54:25] 10Toolforge (Toolforge iteration 20), 07good first task: [components-api] use the `build.prams.image_name` to compare with the `component` - https://phabricator.wikimedia.org/T395076 (10dcaro) 03NEW Thank you for tagging this task with #good_first_task for Wikimedia newcomers! Newcomers often may not be awa... [18:54:35] 10Toolforge (Toolforge iteration 20), 07good first task: [components-api] use the `build.prams.image_name` to compare with the `component` - https://phabricator.wikimedia.org/T395076#10850220 (10dcaro) p:05Triage→03Medium [18:54:48] !log andrew@cloudcumin1001 cloudvirt-canary END (PASS) - Cookbook wmcs.openstack.cloudvirt.lib.ensure_canary (exit_code=0) on eqiad1, with recreate True, for hosts list: ['cloudvirt1047'] [18:55:05] 10Toolforge (Toolforge iteration 20), 07good first task: [components-api] use the `build.prams.image_name` to compare with the `component` - https://phabricator.wikimedia.org/T395076#10850222 (10dcaro) [18:55:05] 10Toolforge (Toolforge iteration 20), 07good first task: [builds-api] populate the `image_name` for the builds returned - https://phabricator.wikimedia.org/T395035#10850223 (10dcaro) [18:57:06] 10Toolforge (Toolforge iteration 20), 07good first task: [components-cli] bash autocomplete does not autocomplete file name when creating config - https://phabricator.wikimedia.org/T395077 (10dcaro) 03NEW Thank you for tagging this task with #good_first_task for Wikimedia newcomers! Newcomers often may not... [18:57:13] 10Toolforge (Toolforge iteration 20), 07good first task: [components-cli] bash autocomplete does not autocomplete file name when creating config - https://phabricator.wikimedia.org/T395077#10850241 (10dcaro) p:05Triage→03Medium [18:57:20] 10Toolforge (Toolforge iteration 20): [components-api] Add resource limits options to all component types - https://phabricator.wikimedia.org/T395068#10850243 (10dcaro) 05Invalid→03Resolved [18:57:29] 10Toolforge (Toolforge iteration 20): [components-api] Add resource limits options to all component types - https://phabricator.wikimedia.org/T395068#10850245 (10dcaro) 05Resolved→03Invalid [19:00:46] (03approved) 10dcaro: [components.models.api_models] add config version check [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/74 (https://phabricator.wikimedia.org/T394273) (owner: 10raymond-ndibe) [19:01:11] (03update) 10raymond-ndibe: [jobs-api] use pydantic for all models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T389118) [19:02:31] 06cloud-services-team, 10Cloud-VPS: new, frequent DNS record leaks in wmcs - https://phabricator.wikimedia.org/T395037#10850252 (10Andrew) 05Open→03Resolved a:03Andrew [19:03:46] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [19:05:41] FIRING: CloudVPSDesignateLeaks: Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:07:26] 06cloud-services-team, 10Toolforge: tools-webservice repo does not support merge requests from forks properly - https://phabricator.wikimedia.org/T374045#10850275 (10dcaro) →14Duplicate dup:03T394595 [19:07:30] 10Toolforge (Toolforge iteration 20), 13Patch-For-Review: [cicd] create cicd flow for non repo owners - https://phabricator.wikimedia.org/T394595#10850278 (10dcaro) [19:08:00] 06cloud-services-team, 10Toolforge: tools-webservice repo does not support merge requests from forks properly - https://phabricator.wikimedia.org/T374045#10850281 (10dcaro) 05Duplicate→03Open No sorry, this is a different task, related to {T394595} but not the same [19:08:31] 06cloud-services-team, 10Toolforge: [lima-kilo] toolforge_deploy_mr.py tdoes not support merge requests from forks properly - https://phabricator.wikimedia.org/T374045#10850286 (10dcaro) [19:08:44] 06cloud-services-team, 10Toolforge: [lima-kilo] toolforge_deploy_mr.py tdoes not support merge requests from forks properly - https://phabricator.wikimedia.org/T374045#10850287 (10dcaro) p:05Triage→03Medium [19:10:22] 06cloud-services-team, 10Toolforge: [components-api] Add webservice support - https://phabricator.wikimedia.org/T362077#10850305 (10dcaro) [19:11:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-1 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesse [19:11:42] 10Cloud Services Proposals, 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 20), 05Cloud-Services-Origin-Team, and 3 others: [Hypothesis] WE6.3.10 start a beta for the push-to-deploy features - https://phabricator.wikimedia.org/T393564#10850309 (10dcaro) [19:13:17] (03update) 10raymond-ndibe: [jobs-api] use pydantic for all models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T389118) [19:13:22] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [19:13:46] (03approved) 10raymond-ndibe: [jobs-api] use pydantic for all models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T389118) [19:13:56] (03update) 10raymond-ndibe: [jobs-api] use pydantic for all models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T389118) [19:15:11] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-2, tools-k8s-worker-nfs-53, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-78, tools-k8s-worker-nfs-14, tools-k8s-worker-nfs-1, tools-k8s-worker-nfs-21 [19:15:41] RESOLVED: CloudVPSDesignateLeaks: Detected 8 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:16:03] FIRING: [10x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-1 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:21:03] FIRING: [4x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:26:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:26:05] !log raymond-ndibe@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component jobs-api [19:26:24] 06cloud-services-team, 10Tool-toolviews, 10Toolforge: Provide basic page view metrics for individual tools on toolforge - https://phabricator.wikimedia.org/T87001#10850373 (10MusikAnimal) a:05MusikAnimal→03None [19:26:52] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [19:28:08] (03update) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] (use_pydantic_for_core_job_model) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) [19:31:03] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-10 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:36:00] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component envvars-api [19:36:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:39:18] vivian-rook closed https://github.com/toolforge/paws/pull/486 [19:41:13] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [19:41:59] (03update) 10raymond-ndibe: jobs-api: bump to 0.0.377-20250521144143-46deeb2b [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/792 (https://phabricator.wikimedia.org/T390137) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [19:41:59] (03approved) 10raymond-ndibe: jobs-api: bump to 0.0.377-20250521144143-46deeb2b [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/792 (https://phabricator.wikimedia.org/T390137) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [19:42:04] (03merge) 10raymond-ndibe: jobs-api: bump to 0.0.377-20250521144143-46deeb2b [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/792 (https://phabricator.wikimedia.org/T390137) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [19:42:35] (03update) 10raymond-ndibe: envvars-api: bump to 0.0.69-20250520050021-383f7616 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/791 (https://phabricator.wikimedia.org/T391966) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [19:45:37] (03merge) 10raymond-ndibe: [jobs-api] use pydantic for all models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/139 (https://phabricator.wikimedia.org/T389118) [19:45:39] (03update) 10raymond-ndibe: [jobs-api] split job models to oneoff, scheduled and continuous [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/154 (https://phabricator.wikimedia.org/T389118 https://phabricator.wikimedia.org/T390136) [19:45:40] (03update) 10raymond-ndibe: [jobs-api] refactor quota models [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/164 (https://phabricator.wikimedia.org/T389118) [19:46:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [19:47:03] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component envvars-api [19:47:18] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-2, tools-k8s-worker-nfs-53, tools-k8s-worker-nfs-47, tools-k8s-worker-nfs-78, tools-k8s-worker-nfs-14, tools-k8s-worker-nfs-1, tools-k8s-worker-nfs-21 [19:47:19] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component envvars-api [19:47:59] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.378-20250522194547-6776b4db [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/794 (https://phabricator.wikimedia.org/T389118) [20:01:03] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:03:18] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component envvars-api [20:11:54] !log raymond-ndibe@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [20:15:56] (03open) 10raymond-ndibe: [toolforge_models] update toolforge_models.py [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/79 [20:16:03] FIRING: [6x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:19:52] (03update) 10raymond-ndibe: [toolforge_models] update toolforge_models.py [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/79 [20:20:21] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [20:20:27] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [20:21:02] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [20:21:03] FIRING: [7x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:22:53] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [20:23:45] !log raymond-ndibe@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [20:38:41] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-22, tools-k8s-worker-nfs-39, tools-k8s-worker-nfs-32, tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-45, tools-k8s-worker-nfs-46, tools-k8s-worker-nfs-55 [20:46:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:01:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:06:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:11:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:16:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:17:00] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-22, tools-k8s-worker-nfs-39, tools-k8s-worker-nfs-32, tools-k8s-worker-nfs-17, tools-k8s-worker-nfs-45, tools-k8s-worker-nfs-46, tools-k8s-worker-nfs-55 [21:21:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:26:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:31:33] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [21:34:55] !log raymond-ndibe@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [21:35:25] (03approved) 10raymond-ndibe: envvars-api: bump to 0.0.69-20250520050021-383f7616 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/791 (https://phabricator.wikimedia.org/T391966) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [21:35:26] (03update) 10raymond-ndibe: envvars-api: bump to 0.0.69-20250520050021-383f7616 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/791 (https://phabricator.wikimedia.org/T391966) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [21:35:41] (03merge) 10raymond-ndibe: envvars-api: bump to 0.0.69-20250520050021-383f7616 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/791 (https://phabricator.wikimedia.org/T391966) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [21:35:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:37:14] 06cloud-services-team, 10Toolforge: Toolforge Build Service does not support .python-version - https://phabricator.wikimedia.org/T381923#10850801 (10LucasWerkmeister) 05Resolved→03Open >>! In T381923#10813247, @LucasWerkmeister wrote: > Seems to be working \o/ I spoke too soon :( I just noticed that the t... [21:39:13] 06cloud-services-team, 10Toolforge: Toolforge Build Service does not support .python-version - https://phabricator.wikimedia.org/T381923#10850808 (10bd808) A build from https://gitlab.wikimedia.org/toolforge-repos/containers-bnc/-/commit/a937384ad626eb5902910906e37749348caa3757 worked for me. [21:41:03] FIRING: [8x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:45:41] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [21:46:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:49:44] !log raymond-ndibe@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [21:51:03] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-17 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [21:56:35] 06cloud-services-team, 10Toolforge: Toolforge Build Service does not support .python-version - https://phabricator.wikimedia.org/T381923#10850835 (10LucasWerkmeister) 05Open→03Resolved False alarm – turns out this error was caused by an unfortunate confluence of several factors: - Build service contai... [22:03:32] (03update) 10raymond-ndibe: jobs-api: bump to 0.0.378-20250522194547-6776b4db [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/794 (https://phabricator.wikimedia.org/T389118) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [22:03:35] (03approved) 10raymond-ndibe: jobs-api: bump to 0.0.378-20250522194547-6776b4db [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/794 (https://phabricator.wikimedia.org/T389118) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [22:03:37] (03update) 10raymond-ndibe: jobs-api: bump to 0.0.378-20250522194547-6776b4db [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/794 (https://phabricator.wikimedia.org/T389118) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [22:04:22] (03merge) 10raymond-ndibe: jobs-api: bump to 0.0.378-20250522194547-6776b4db [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/794 (https://phabricator.wikimedia.org/T389118) (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [22:48:26] (03update) 10raymond-ndibe: [components-api] support health-checks and port [repos/cloud/toolforge/components-api] (update_toolforge_models) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/75 (https://phabricator.wikimedia.org/T362072) [22:48:51] 06cloud-services-team, 10Toolforge: Install mariadb-dump on Toolforge bastions - https://phabricator.wikimedia.org/T378882#10850911 (10LucasWerkmeister) >>! In T378882#10305208, @LucasWerkmeister wrote: > `toolforge jobs` does not support streaming the output, so to actually get an offsite backup I either have... [22:52:38] 10Tools: cluebotng-staging tool uses ~560G of disk space - https://phabricator.wikimedia.org/T395006#10850912 (10DamianZaremba) https://github.com/cluebotng/botng/commit/89e2e6635beb88a7288f9f94e84335ac166f4395 Before: ` tools.cluebotng-staging@tools-bastion-12:~$ du -shc botng-*.log 11G botng-20250501.log 11G... [23:06:03] FIRING: [2x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-37 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [23:14:14] 10Tools: zoomviewer uses an unreasonable amount of disk space - https://phabricator.wikimedia.org/T395020#10850927 (10tstarling) >>! In T285018#9536578, @tstarling wrote: > I reduced the expiry time to 30 days. > > Also, I fixed a bug causing originals to be deleted less than 1 day after download. > > Previou...