[00:04:34] RESOLVED: DiskSpace: Disk space cloudcontrol1005:9100:/ 5.763% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:56:16] FIRING: [3x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:01:16] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [01:11:53] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10474938 (10JJMC89) [01:12:09] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10474939 (10JJMC89) [01:40:14] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10474962 (10MuhammadShuaib) |**Wikitech account/LDAP:**| محمد شعیب| |**SUL account**| Yethrosh| |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |Y| |**I have visited [[ https://... [02:00:36] 06cloud-services-team, 10Toolforge: [jobs-emailer] duplicate failure emails - https://phabricator.wikimedia.org/T382866#10474963 (10JJMC89) [02:19:20] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10474966 (10Ladsgroup) Renamed the wikitech account to Yethrosh and force attached it. You should be able to access it now. [02:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:59:27] 06cloud-services-team, 10Toolforge: [jobs-emailer] duplicate failure emails - https://phabricator.wikimedia.org/T382866#10474972 (10JJMC89) [03:11:43] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10474973 (10MuhammadShuaib) >>! In T376267#10474966, @Ladsgroup wrote: > Renamed the wikitech account to Yethrosh and force attached it. You should be able to access it now. Thanks :) [03:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:38:21] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10475012 (10JJMC89) [06:03:02] 06cloud-services-team, 10Toolforge: [jobs-emailer] duplicate failure emails - https://phabricator.wikimedia.org/T382866#10475017 (10JJMC89) [07:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:38:06] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10475410 (10dcaro) I have not had time yet to take a look, but usually 137 might happen when the process is killed due to OOM (killed by the system), if you did not chan... [09:44:55] 10PAWS: Update wikipedia_family.py at PAWS - https://phabricator.wikimedia.org/T383920#10475442 (10Theklan) The issue wans't at the family declaration, but was interesting: the script couldn't read a wiki I haven't visited before. Just opening each of the neglected wikis solved the issue. [10:05:17] 06cloud-services-team, 10Toolforge: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475512 (10dcaro) I'm using it on fedora 41 without issues (recreated it last week), I'll try again see i... [10:09:07] 06cloud-services-team, 10Toolforge: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475534 (10dcaro) >>! In T384142#10475512, @dcaro wrote: > I'm using it on fedora 41 without issues (recr... [10:17:02] 06cloud-services-team, 10Toolforge: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475555 (10dcaro) It seems that `9p` was made the default when bumping to `1.0.0`: {F58229150} In one of... [10:28:45] (03PS1) 10Slyngshede: P:idp missing airflow secret [labs/private] - 10https://gerrit.wikimedia.org/r/1112706 [10:29:35] (03CR) 10Slyngshede: [V:03+2 C:03+2] P:idp missing airflow secret [labs/private] - 10https://gerrit.wikimedia.org/r/1112706 (owner: 10Slyngshede) [10:39:49] (03open) 10dcaro: limactl: replace LIMA_CIDATA_* with variables [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/219 (https://phabricator.wikimedia.org/T384140) [10:40:20] (03open) 10dcaro: start-devenv: set the mounttype to reverse-sshfs for linux [repos/cloud/toolforge/lima-kilo] (fix_lima_cidata) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/220 (https://phabricator.wikimedia.org/T384142) [10:41:34] (03open) 10dcaro: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] (fix_mount_linux) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 [10:42:45] (03update) 10dcaro: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] (fix_mount_linux) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 [10:42:46] (03update) 10dcaro: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] (fix_mount_linux) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 [10:43:50] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475654 (10dcaro) a:03dcaro [10:44:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475657 (10dcaro) 05Open→03In progress [10:44:11] (03PS2) 10Majavah: templates: Use Codex for tool project/repo pages [labs/striker] - 10https://gerrit.wikimedia.org/r/1112330 (https://phabricator.wikimedia.org/T380114) [10:44:12] (03PS1) 10Majavah: build: Fix node test pipeline [labs/striker] - 10https://gerrit.wikimedia.org/r/1112709 [10:44:23] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: provisioning scripts should not reference the LIMA_CIDATA variables - https://phabricator.wikimedia.org/T384140#10475658 (10dcaro) a:03dcaro [10:44:33] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: provisioning scripts should not reference the LIMA_CIDATA variables - https://phabricator.wikimedia.org/T384140#10475661 (10dcaro) 05Open→03In progress [10:44:42] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: provisioning scripts should not reference the LIMA_CIDATA variables - https://phabricator.wikimedia.org/T384140#10475662 (10dcaro) p:05Triage→03Low [10:44:54] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475663 (10dcaro) p:05Triage→03High [10:47:08] (03PS2) 10Majavah: build: Fix node test pipeline [labs/striker] - 10https://gerrit.wikimedia.org/r/1112709 [10:47:14] (03CR) 10Majavah: [C:03+2] templates: Use Codex for tool project/repo pages [labs/striker] - 10https://gerrit.wikimedia.org/r/1112330 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [10:48:29] (03Merged) 10jenkins-bot: templates: Use Codex for tool project/repo pages [labs/striker] - 10https://gerrit.wikimedia.org/r/1112330 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [10:48:51] (03CR) 10Majavah: [C:03+2] build: Fix node test pipeline [labs/striker] - 10https://gerrit.wikimedia.org/r/1112709 (owner: 10Majavah) [10:50:12] (03Merged) 10jenkins-bot: build: Fix node test pipeline [labs/striker] - 10https://gerrit.wikimedia.org/r/1112709 (owner: 10Majavah) [10:53:27] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [jobs-api] treat URLs with and without a trailing slash the same - https://phabricator.wikimedia.org/T383798#10475686 (10dcaro) a:05Slst2020→03dcaro [10:53:33] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api, components-cli] deploy-token: separate create from update - https://phabricator.wikimedia.org/T380706#10475688 (10dcaro) a:05Slst2020→03dcaro [10:53:35] 10Toolforge (Toolforge iteration 17): [components-api] add basic prometheus instrumentation - https://phabricator.wikimedia.org/T381249#10475689 (10dcaro) a:05Slst2020→03dcaro [10:53:51] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api] Add functional tests for the components api - https://phabricator.wikimedia.org/T379092#10475691 (10dcaro) a:05Slst2020→03dcaro [10:59:21] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned, 13Patch-For-Review: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter - https://phabricator.wikimedia.org/T343885#10475713 (10fnegri) The Grafana dashboard [... [11:10:18] 10Tools: CropTool sometimes locks and have to be manually restarted - https://phabricator.wikimedia.org/T198503#10475751 (10dcaro) >>! In T198503#10468362, @bd808 wrote: > If this is still an issue that @Danmichaelo sees in Toolforge the modern "fix" might be adding a health check that makes sure that the PHP in... [11:22:55] (03approved) 10fnegri: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] (fix_mount_linux) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 (owner: 10dcaro) [11:23:08] (03approved) 10fnegri: start-devenv: set the mounttype to reverse-sshfs for linux [repos/cloud/toolforge/lima-kilo] (fix_lima_cidata) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/220 (https://phabricator.wikimedia.org/T384142) (owner: 10dcaro) [11:24:37] (03approved) 10fnegri: limactl: replace LIMA_CIDATA_* with variables [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/219 (https://phabricator.wikimedia.org/T384140) (owner: 10dcaro) [12:12:16] (03merge) 10dcaro: limactl: replace LIMA_CIDATA_* with variables [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/219 (https://phabricator.wikimedia.org/T384140) [12:12:17] (03update) 10dcaro: start-devenv: set the mounttype to reverse-sshfs for linux [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/220 (https://phabricator.wikimedia.org/T384142) [12:12:48] (03merge) 10dcaro: start-devenv: set the mounttype to reverse-sshfs for linux [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/220 (https://phabricator.wikimedia.org/T384142) [12:12:48] (03update) 10dcaro: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 [12:13:18] (03update) 10dcaro: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 [12:13:23] (03merge) 10dcaro: start-devenv: add a simple warning to upgrade limactl [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/221 [12:14:04] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: provisioning scripts should not reference the LIMA_CIDATA variables - https://phabricator.wikimedia.org/T384140#10475884 (10dcaro) 05In progress→03Resolved [12:14:09] 06cloud-services-team, 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: lima-kilo: Directory mount does not work on linux hosts (/bin/bash: line 1: ./lima-vm/install.sh: No such file or directory) - https://phabricator.wikimedia.org/T384142#10475886 (10dcaro) 05In progress→03Resolved [12:16:06] 06cloud-services-team, 10Toolforge: [infra,k8s,o11y] Introduce worker checks - https://phabricator.wikimedia.org/T380985#10475889 (10dcaro) This might be a duplicated of the old {T242637} [12:16:43] (03update) 10dcaro: api: make trailing slash optional [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/137 (https://phabricator.wikimedia.org/T383798) (owner: 10sstefanova) [12:20:34] FIRING: DiskSpace: Disk space cloudcontrol1005:9100:/ 5.976% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:30:34] RESOLVED: DiskSpace: Disk space cloudcontrol1005:9100:/ 5.953% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:42:34] FIRING: DiskSpace: Disk space cloudcontrol1005:9100:/ 5.991% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:45:08] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [metricsinfra] alerts do not get propagated to prod alertmanager - https://phabricator.wikimedia.org/T384200 (10dcaro) 03NEW p:05Triage→03High [12:45:17] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [metricsinfra] alerts do not get propagated to prod alertmanager - https://phabricator.wikimedia.org/T384200#10475949 (10dcaro) 05Open→03In progress [12:49:43] 06cloud-services-team, 10Toolforge, 07Epic: Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10475958 (10Sarai-WMF) [12:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:07:56] 10cloud-services-team (FY2024/2025-Q3-Q4), 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Unplanned: [metricsinfra] alerts do not get propagated to prod alertmanager - https://phabricator.wikimedia.org/T384200#10476012 (10dcaro) So the connectivity between prometheus and alertmanager was not working... [13:09:27] (03update) 10dcaro: scheduled job: add timeout parameter [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/76 (https://phabricator.wikimedia.org/T306391) [13:20:34] (03PS1) 10David Caro: roll_restart_osd_daemons: allow oking all the rest of osds [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1112754 [13:25:01] (03CR) 10CI reject: [V:04-1] roll_restart_osd_daemons: allow oking all the rest of osds [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1112754 (owner: 10David Caro) [13:25:21] (03PS2) 10David Caro: roll_restart_osd_daemons: allow oking all the rest of osds [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1112754 [13:26:21] (03CR) 10David Caro: roll_restart_osd_daemons: allow oking all the rest of osds (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1112754 (owner: 10David Caro) [13:28:25] !log dcaro@urcuchillay toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component components-cli [13:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:30:11] 10Toolforge (Toolforge iteration 17): [components-cli,lima-kilo] deploy compontents-cli on lima-kilo by default - https://phabricator.wikimedia.org/T384203 (10dcaro) 03NEW p:05Triage→03High [13:30:16] 10Toolforge (Toolforge iteration 17): [components-cli,lima-kilo] deploy compontents-cli on lima-kilo by default - https://phabricator.wikimedia.org/T384203#10476073 (10dcaro) a:03dcaro [13:30:37] 10Toolforge (Toolforge iteration 17): [components-cli,lima-kilo] deploy compontents-cli on lima-kilo by default - https://phabricator.wikimedia.org/T384203#10476075 (10dcaro) 05Open→03In progress [13:31:59] 10Toolforge (Toolforge iteration 17): [components-cli,lima-kilo] deploy compontents-cli on lima-kilo by default - https://phabricator.wikimedia.org/T384203#10476082 (10dcaro) [13:32:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api] Add functional tests for the components api - https://phabricator.wikimedia.org/T379092#10476083 (10dcaro) [13:33:32] (03open) 10dcaro: packages: install components-cli by default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/222 (https://phabricator.wikimedia.org/T384203) [13:36:38] !log dcaro@urcuchillay toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component components-cli [13:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:40:29] (03update) 10dcaro: packages: install components-cli by default [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/222 (https://phabricator.wikimedia.org/T384203) [13:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:08:39] 06cloud-services-team, 10Toolforge, 07Epic: Toolforge UI: Investigate integration of Striker functionality - https://phabricator.wikimedia.org/T383146#10476249 (10taavi) [14:15:22] FIRING: [2x] HAProxyBackendUnavailable: HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [14:15:46] (03update) 10raymond-ndibe: [jobs-api] refactor validate_kube_quant [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T361120) [14:15:59] 10Tool-schedule-deployment: Allow scheduling for current backport window - https://phabricator.wikimedia.org/T381237#10476265 (10Lucas_Werkmeister_WMDE) >>! In T381237#10382626, @kostajh wrote: > IMO if the submission happens at any time during the current deployment window, it's reasonable. Assumption is that t... [14:16:00] (03close) 10raymond-ndibe: [jobs-api] refactor validate_kube_quant [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/118 (https://phabricator.wikimedia.org/T361120) [14:16:54] 10Striker: Stop trying to store MW real name in Striker - https://phabricator.wikimedia.org/T384206 (10taavi) 03NEW [14:17:34] RESOLVED: DiskSpace: Disk space cloudcontrol1005:9100:/ 5.68% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=cloudcontrol1005 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [14:20:22] RESOLVED: [2x] HAProxyBackendUnavailable: HAProxy service keystone-admin-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [15:03:54] 10wikitech.wikimedia.org, 10Wikimedia-Site-requests, 13Patch-For-Review: Drop local 'OAuth administrators' group from Wikitech - https://phabricator.wikimedia.org/T384122#10476666 (10taavi) 05Open→03Resolved [15:04:01] 10wikitech.wikimedia.org, 13Patch-For-Review: Remove OATH validators user group in Wikitech - https://phabricator.wikimedia.org/T384123#10476667 (10taavi) 05Open→03Resolved [15:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:02:08] 06cloud-services-team: PuppetDisabled - https://phabricator.wikimedia.org/T384082#10476972 (10fnegri) Puppet was manually disabled for a while, but it's now enabled again: ` fnegri@cloudcontrol2004-dev:~$ sudo zless /var/log/puppet.log.3.gz [...] 2025-01-17T01:10:16.533143+00:00 cloudcontrol2004-dev puppet-agen... [16:02:57] 06cloud-services-team: PuppetDisabled - https://phabricator.wikimedia.org/T384082#10476977 (10fnegri) 05Open→03Resolved a:03fnegri [16:20:08] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate pontoon-puppet-01.monitoring.eqiad.wmflabs is about to expire in 25d 23h 56m 27s - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetCertificateAboutToExpire - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:23:37] 06cloud-services-team: PuppetDisabled Puppet disabled on cloudcontrol2004-dev:9100 - https://phabricator.wikimedia.org/T384072#10477105 (10fnegri) →14Duplicate dup:03T384082 [16:23:38] 06cloud-services-team: PuppetDisabled - https://phabricator.wikimedia.org/T384082#10477107 (10fnegri) [16:23:51] 06cloud-services-team, 10Cloud-VPS: prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901#10477108 (10dcaro) Did a bit of digging during the sync :) It seems we are failing to authenticate to the lists.wikimedia.org server: ` 2025-01-20 16:20:13 1tZuVU-00435P-1X ** cloud... [16:27:24] 06cloud-services-team, 10Cloud-VPS: prometheus wmcloud alerts stopped sending emails - https://phabricator.wikimedia.org/T380901#10477135 (10taavi) Probably due to this: `lang=shell-session taavi@runko:~ $ host wmflabs.org wmflabs.org has address 185.15.56.49 wmflabs.org mail is handled by 10 mx1001.wikimedia.... [16:39:03] (03PS2) 10Andrew Bogott: Re-enable the network panel for instance creation [openstack/horizon/horizon] (2024.1) - 10https://gerrit.wikimedia.org/r/1110464 (https://phabricator.wikimedia.org/T380081) [16:39:03] (03PS2) 10Andrew Bogott: launch-instance-model: support default network ID in the network panel [openstack/horizon/horizon] (2024.1) - 10https://gerrit.wikimedia.org/r/1110465 (https://phabricator.wikimedia.org/T380081) [17:27:27] (03PS1) 10Vgutierrez: secret: Add dummy pki.goog staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/1112814 [17:28:03] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secret: Add dummy pki.goog staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/1112814 (owner: 10Vgutierrez) [17:38:14] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10477377 (10JJMC89) [17:49:05] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10477423 (10JJMC89) If they're being OOM killed, it is not due to needing to exceed the requested resources. Also, I think the email will say that it was OOM killed when... [18:07:03] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10477456 (10dcaro) >>! In T382865#10477423, @JJMC89 wrote: > If they're being OOM killed, it is not due to needing to exceed the requested resources. Also, I think the e... [18:18:06] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_tool_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:18:28] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [infra,k8s] Workers are not rotating the /var/log/acctount/pacct log and it's growing - https://phabricator.wikimedia.org/T384250 (10dcaro) 03NEW [18:19:31] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-cli] If the pod exists and it has no logs, read the message status from it and output that - https://phabricator.wikimedia.org/T384251 (10dcaro) 03NEW [18:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:20:55] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [jobs-emailer] If the pod is in error status, try to get the status.message field in the email, otherwise just 'error' is not that useful - https://phabricator.wikimedia.org/T384252 (10dcaro) 03NEW [18:23:06] FIRING: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:28:06] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [18:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:33:31] (03open) 10bd808: phabricator: Catch JSONDecodeError [toolforge-repos/gitlab-account-approval] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/16 [18:33:50] 06cloud-services-team, 10Toolforge (Toolforge iteration 17): [infra,k8s] Workers are not rotating the /var/log/acctount/pacct log and it's growing - https://phabricator.wikimedia.org/T384250#10477549 (10dcaro) Did a quick look, these are the workers currently having issues: {F58231015} Used the query `node_fi... [19:37:17] (03update) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] (support_script_health_check_for_all_jobs) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [19:38:04] (03update) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [19:38:21] (03close) 10raymond-ndibe: [jobs-api] support script healthcheck for all job types [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/135 (https://phabricator.wikimedia.org/T377420) [19:55:50] (03update) 10raymond-ndibe: [jobs-cli] support http healthcheck for continuous jobs [repos/cloud/toolforge/jobs-cli] (support_script_healthcheck_for_all_jobs) - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/81 (https://phabricator.wikimedia.org/T362621) [19:56:43] (03update) 10raymond-ndibe: [jobs-cli] support http healthcheck for continuous jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/81 (https://phabricator.wikimedia.org/T362621) [19:56:57] (03close) 10raymond-ndibe: [jobs-cli] support script healthcheck for all jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/80 (https://phabricator.wikimedia.org/T377420) [20:14:17] (03update) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [21:04:26] (03update) 10raymond-ndibe: [jobs-cli] support http healthcheck for continuous jobs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/81 (https://phabricator.wikimedia.org/T362621) [21:04:57] (03update) 10raymond-ndibe: [jobs-api] support http health check [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/136 (https://phabricator.wikimedia.org/T362621) [21:09:45] (03merge) 10bd808: phabricator: Catch JSONDecodeError [toolforge-repos/gitlab-account-approval] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/16 [21:21:39] FIRING: [2x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:26:39] RESOLVED: [4x] ProbeDown: Service tools-legacy-redirector-2:443 has failed probes (http_tools_wmflabs_org_main_page_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#tools-legacy-redirector-2:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [21:30:05] 10Tool-ranker, 06translatewiki.net, 10LPL Essential (LPL Essential 2024 Nov-Dec), 07Unplanned-Sprint-Work: Add Ranker to translatewiki.net - https://phabricator.wikimedia.org/T384061#10477892 (10Wangombe) 05Open→03In progress a:03Wangombe [21:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [22:17:30] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [22:19:20] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 0.927 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [22:25:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-58 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [22:35:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-58 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:00:30] 06cloud-services-team, 10wikitech.wikimedia.org, 06Infrastructure-Foundations, 07Epic: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859#10478032 (10bd808) @Ladsgroup I have made some progress on the tasks I volunteered for in our discussion last week. I created a new wikitech-sul-migr... [23:16:22] 10wikitech.wikimedia.org, 06serviceops-radar, 06SRE, 07SRE-Unowned: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10478045 (10Andrew) On @CDanis's suggestion, 'static wikitech-static' can now be built in a docker container using https://gitlab.wikimedia.org/repos/sre/wikitech-static-... [23:17:52] (03open) 10bd808: dev: Upgrade to mwclient 0.11.0 [toolforge-repos/gitlab-account-approval] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/17 (https://phabricator.wikimedia.org/T372311) [23:22:37] (03merge) 10bd808: dev: Upgrade to mwclient 0.11.0 [toolforge-repos/gitlab-account-approval] - 10https://gitlab.wikimedia.org/toolforge-repos/gitlab-account-approval/-/merge_requests/17 (https://phabricator.wikimedia.org/T372311) [23:27:12] 10Tool-gitlab-account-approval, 13Patch-For-Review: Upgrade to mwclient 0.11.0 - https://phabricator.wikimedia.org/T372311#10478055 (10bd808) 05Open→03Resolved p:05Triage→03Medium a:03bd808 [23:33:47] 10Striker, 10Tool-phab-ban, 10Bitu, 10MediaWiki-Action-API, 10Stashbot: Removal of writeapi from siteinfo output breaks all mwclient-based bots, including stashbot (Server Admin Log) - https://phabricator.wikimedia.org/T371977#10478063 (10bd808)