[00:52:20] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [00:53:48] RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [00:58:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.osd.undrain_node [01:38:51] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [01:40:38] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [01:42:19] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [01:44:11] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [01:49:17] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [01:51:29] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [02:00:44] (03open) 10raymond-ndibe: [registry-admission] add releng registry to local allowedRegistries [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/958 (https://phabricator.wikimedia.org/T394595) [02:03:46] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [02:27:16] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [02:59:56] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [03:03:04] FIRING: ObjectStorageSizeQuotaFull: Object storage quota by 'size' is 80.84% full for project tools - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/ObjectStorageSizeQuotaFull - https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas - https://alerts.wikimedia.org/?q=alertname%3DObjectStorageSizeQuotaFull [03:09:33] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [03:11:33] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [03:21:28] FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:55] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [03:24:47] (03update) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/213 [03:28:21] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1185912 (owner: 10L10n-bot) [03:50:20] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [03:53:12] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [03:54:27] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [03:55:15] (03update) 10raymond-ndibe: [helm image publish]: publish to reggie repo if PR owner not repo owner [repos/cloud/cicd/gitlab-ci] - 10https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/61 (https://phabricator.wikimedia.org/T394595) [03:55:30] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [03:55:49] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [03:56:05] (03update) 10raymond-ndibe: [registry-admission] add releng registry to local allowedRegistries [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/958 (https://phabricator.wikimedia.org/T394595) [03:57:51] (03update) 10raymond-ndibe: [toolforge_deploy_mr.py] support deploy of MRs from external contributors [repos/cloud/toolforge/lima-kilo] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/273 (https://phabricator.wikimedia.org/T394595) [06:06:15] 10Tool-archive-externa-links: Création de tableau de bord - https://phabricator.wikimedia.org/T399889#11161085 (10poro26) a:03poro26 >>! Dans T399889#11155459, @Aklapper a écrit : > I missed this ticket as it was not tagged with #Project-Admins or #Phabricator (please don't assign tasks to me without my agreem... [07:03:04] FIRING: ObjectStorageSizeQuotaFull: Object storage quota by 'size' is 80.84% full for project tools - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/ObjectStorageSizeQuotaFull - https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas - https://alerts.wikimedia.org/?q=alertname%3DObjectStorageSizeQuotaFull [08:02:40] (03update) 10dcaro: package: upgrade deps [repos/cloud/toolforge/volume-admission] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/volume-admission/-/merge_requests/35 [08:03:44] (03approved) 10filippo: quota: use the same value for `request` and `limit` [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/74 (owner: 10dcaro) [08:06:08] (03merge) 10dcaro: quota: use the same value for `request` and `limit` [repos/cloud/toolforge/maintain-kubeusers] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/74 [08:07:21] (03approved) 10dcaro: [registry-admission] add releng registry to local allowedRegistries [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/958 (https://phabricator.wikimedia.org/T394595) (owner: 10raymond-ndibe) [08:09:05] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: maintain-kubeusers: bump to 0.0.181-20250909080634-8d3f947f [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/959 (https://phabricator.wikimedia.org/T403962) [08:11:28] RESOLVED: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [08:17:01] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component mainain-kubeusers [08:17:05] !log dcaro@cloudcumin1001 toolsbeta END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component mainain-kubeusers [08:17:11] !log dcaro@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [08:33:55] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1182695 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [08:36:05] !log dcaro@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [08:45:38] PROBLEM - mysqld processes on an-redacteddb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:50:19] 06cloud-services-team, 10Toolforge: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047 (10Magnus) 03NEW [08:52:45] 06cloud-services-team, 10Toolforge: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047#11161526 (10dcaro) a:03dcaro Looking [08:54:11] 06cloud-services-team, 10Toolforge: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047#11161532 (10dcaro) NFS went away, rebooting: ` [Tue Sep 9 05:02:50 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying ` [08:55:10] !log dcaro@acme tools START - Cookbook wmcs.vps.instance.force_reboot (T404047) [08:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:55:15] T404047: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047 [08:55:17] !log dcaro@acme tools END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) (T404047) [08:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [08:57:12] (03PS1) 10David Caro: vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 [09:02:15] (03CR) 10Filippo Giunchedi: "LGTM, just a minor thing inline" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [09:34:36] !log dcaro@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers [09:38:07] 06cloud-services-team, 10Cloud-VPS: Add fqdn input to instance-related wmcs cookbooks - https://phabricator.wikimedia.org/T404052 (10fgiunchedi) 03NEW [09:49:12] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: jobs-api limits quota in restrictive ways - https://phabricator.wikimedia.org/T403962#11161727 (10dcaro) The requests limit change has been deployed: ` tools.wm-lol@tools-bastion-13:~$ kubectl describe quota Name: tool-wm-lol Namespace... [09:51:02] 10Toolforge (Quota-requests): Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964#11161730 (10dcaro) > Generally it would be useful to have some more quota for this tool though, we could get away with slightly less than specified above if things are very tight,... [09:54:30] (03CR) 10Volans: "left couple of suggestions" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [09:56:02] !log dcaro@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers [09:56:09] 06cloud-services-team, 10Toolforge: Improve detection of failing ssh to toolforge bastions - https://phabricator.wikimedia.org/T404054 (10fgiunchedi) 03NEW [09:56:36] 06cloud-services-team, 10Toolforge: toolforge ssh login hangs right before prompt - https://phabricator.wikimedia.org/T404047#11161754 (10fgiunchedi) 05Open→03Resolved Optimistically resolving, followup for improved alerting is at {T404054} [09:56:59] 06cloud-services-team, 10Toolforge: Improve detection of failing ssh to toolforge bastions - https://phabricator.wikimedia.org/T404054#11161758 (10fgiunchedi) [10:05:42] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [10:05:59] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [10:12:51] (03PS2) 10David Caro: vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 [10:12:51] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (032 comments) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [10:13:57] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [10:26:41] (03CR) 10Volans: "replies inline" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [10:36:41] RECOVERY - mysqld processes on an-redacteddb1001 is OK: PROCS OK: 8 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:03:04] FIRING: ObjectStorageSizeQuotaFull: Object storage quota by 'size' is 81.21% full for project tools - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/ObjectStorageSizeQuotaFull - https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas - https://alerts.wikimedia.org/?q=alertname%3DObjectStorageSizeQuotaFull [11:24:58] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [11:46:58] (03PS1) 10David Caro: global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 [11:47:07] (03CR) 10CI reject: [V:04-1] global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 (owner: 10David Caro) [11:47:52] (03PS3) 10David Caro: vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 [11:51:56] FIRING: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [11:52:28] FIRING: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [11:55:07] (03PS4) 10David Caro: vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 [11:55:07] (03PS2) 10David Caro: global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 [11:55:49] 06cloud-services-team, 10Toolforge: disable_tool fails to archive mpic-alpha-demo - https://phabricator.wikimedia.org/T404072 (10fgiunchedi) 03NEW [11:58:47] (03CR) 10CI reject: [V:04-1] global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 (owner: 10David Caro) [12:07:21] !log dcaro@cloudcumin1001 paws START - Cookbook wmcs.openstack.cloudvirt.vm_console [12:09:09] !log dcaro@cloudcumin1001 paws END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [12:15:17] !log dcaro@acme paws START - Cookbook wmcs.vps.instance.force_reboot [12:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:15:25] !log dcaro@acme paws END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) [12:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:15:41] !log dcaro@cloudcumin1001 paws START - Cookbook wmcs.openstack.cloudvirt.vm_console [12:16:14] !log dcaro@acme paws START - Cookbook wmcs.vps.instance.force_reboot [12:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:16:16] !log dcaro@acme paws END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) [12:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:16:55] RESOLVED: PawsJupyterHubDown: PAWS JupyterHub is down https://wikitech.wikimedia.org/wiki/PAWS/Admin - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPawsJupyterHubDown [12:17:28] RESOLVED: TargetDown: Job jupyterhub is unreachable in project paws instance hub-paws.wmcloud.org:443 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown [12:20:55] !log dcaro@cloudcumin1001 paws END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) [12:25:55] (03CR) 10Majavah: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [12:30:52] 10PAWS: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076#11162455 (10dcaro) [12:31:56] 10cloud-services-team (FY2025/26-Q1), 10PAWS: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076#11162464 (10dcaro) [12:32:10] 10cloud-services-team (FY2025/26-Q1), 10PAWS: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076#11162467 (10dcaro) p:05Triage→03High [12:32:32] 10cloud-services-team (FY2025/26-Q1), 10PAWS: [paws] 2025-09-09 unexpected downtime - https://phabricator.wikimedia.org/T404076#11162468 (10dcaro) The cluster came up after force-rebooting the two not ready nodes. [12:46:56] (03update) 10vriaa: feat: Make editor responsive [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/21 [13:07:15] (03PS3) 10David Caro: global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 [13:10:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.ceph.upgrade_osds (T402190) [13:10:08] T402190: [ceph,eqiad1] upgrade from pacific->quincy - https://phabricator.wikimedia.org/T402190 [13:11:05] (03CR) 10CI reject: [V:04-1] global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 (owner: 10David Caro) [13:12:30] PROBLEM - Host cloudcephosd1016 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:50] RECOVERY - Host cloudcephosd1016 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [13:14:07] (03approved) 10raymond-ndibe: [registry-admission] add releng registry to local allowedRegistries [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/958 (https://phabricator.wikimedia.org/T394595) [13:14:08] (03merge) 10raymond-ndibe: [registry-admission] add releng registry to local allowedRegistries [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/958 (https://phabricator.wikimedia.org/T394595) [13:14:39] (03approved) 10raymond-ndibe: [cli] add tool config to deployment object [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/58 (https://phabricator.wikimedia.org/T400064) [13:16:27] (03close) 10raymond-ndibe: [DO NOT MERGE] testing gitlab ci changes [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/210 [13:17:10] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [13:22:02] !log dcaro@acme toolsbeta START - Cookbook wmcs.vps.instance.force_reboot vm toolsbeta-cumin-1 (cluster eqiad1, project toolsbeta) [13:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:22:15] !log dcaro@acme toolsbeta END (PASS) - Cookbook wmcs.vps.instance.force_reboot (exit_code=0) vm toolsbeta-cumin-1 (cluster eqiad1, project toolsbeta) [13:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [13:22:40] (03PS5) 10David Caro: vps.instance.force_reboot: add cookbook [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 [13:22:40] (03PS4) 10David Caro: global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 [13:23:36] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [13:24:47] (03CR) 10David Caro: vps.instance.force_reboot: add cookbook (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186443 (owner: 10David Caro) [13:26:01] (03CR) 10CI reject: [V:04-1] global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 (owner: 10David Caro) [13:36:35] (03open) 10vriaa: feat: Persist banners in local storage [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/25 [13:47:07] 06cloud-services-team, 10Data-Services, 06Data-Engineering, 06Data-Engineering-Icebox, 10Datasets-General-or-Unknown: Provide dumps using bittorrent - https://phabricator.wikimedia.org/T29653#11162783 (10Broken-Viking) Subscribing to this as I'm the person currently taking on making [[https://academictor... [13:50:34] (03CR) 10Rehan_khan_78: "Please review my new patch." [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1181245 (https://phabricator.wikimedia.org/T316197) (owner: 10Rehan_khan_78) [14:01:17] (03update) 10raymond-ndibe: [status] make job status an enum, with clearly defined states [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/208 (https://phabricator.wikimedia.org/T401172) [14:06:13] (03PS5) 10David Caro: global: minor cleanups [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 [14:14:26] (03update) 10raymond-ndibe: [status] make job status an enum, with clearly defined states [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/208 (https://phabricator.wikimedia.org/T401172) [14:18:13] (03CR) 10Aklapper: "Please don't add random (?) people as reviewers - thanks!" [labs/tools/Isa] - 10https://gerrit.wikimedia.org/r/1181245 (https://phabricator.wikimedia.org/T316197) (owner: 10Rehan_khan_78) [14:27:07] 06cloud-services-team, 10Data-Services: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090 (10fnegri) 03NEW [14:28:17] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090#11163070 (10fnegri) 05Open→03In progress Replication lag is now going down quite quickly and I expect it to be in sync in a few hours. [14:28:25] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090#11163074 (10fnegri) p:05Triage→03High [14:35:28] FIRING: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [14:47:15] 06cloud-services-team, 10Data-Services, 06Data-Persistence, 06Privacy Engineering, and 6 others: Title of suppressed recentchanges entries can be viewed through the wiki replicas - https://phabricator.wikimedia.org/T402283#11163147 (10sbassett) [15:08:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:09:19] RESOLVED: ObjectStorageSizeQuotaFull: Object storage quota by 'size' is 81.27% full for project tools - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/ObjectStorageSizeQuotaFull - https://grafana.wikimedia.org/d/7120b794-4638-49f5-bccd-9716efc60f24/wmcs-object-storage-quotas - https://alerts.wikimedia.org/?q=alertname%3DObjectStorageSizeQuotaFull [15:15:23] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: [Documentation] Modification de la ligne de code d'importation de script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404095 (10poro26) 03NEW [15:15:51] PROBLEM - Host cloudcephosd1037 is DOWN: PING CRITICAL - Packet loss = 100% [15:18:21] RECOVERY - Host cloudcephosd1037 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [15:23:31] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: [Documentation] Modification de la ligne de code d'importation de script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404095#11163352 (10poro26) 05Open→03Resolved Modification effectuée : https://www.wikidata.org/wi... [15:34:29] 10cloud-services-team (FY2025/26-Q1), 10Data-Services: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090#11163434 (10fnegri) 05In progress→03Resolved Replication is back in sync. {F65992339} [16:18:17] FIRING: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:18:25] 06cloud-services-team: JobUnavailable Reduced availability for job openstack in cloud@codfw - https://phabricator.wikimedia.org/T404109 (10phaultfinder) 03NEW [16:19:25] (03CR) 10David Caro: global: minor cleanups (031 comment) [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1186483 (owner: 10David Caro) [16:23:17] RESOLVED: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:45:28] FIRING: InstanceDown: Project tools instance tools-prometheus-9 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:47:38] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.ceph.upgrade_osds (exit_code=99) [16:48:17] FIRING: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:48:20] (03open) 10fnegri: Increase quotas for tool cluebotng-review [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/960 (https://phabricator.wikimedia.org/T403964) [16:48:47] 10Toolforge (Quota-requests), 13Patch-For-Review: Request increased quota for cluebotng-review Toolforge tool - https://phabricator.wikimedia.org/T403964#11163959 (10fnegri) 05Open→03In progress p:05Triage→03Medium [16:53:17] RESOLVED: JobUnavailable: Reduced availability for job openstack in cloud@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:21] (03approved) 10dcaro: Increase quotas for tool cluebotng-review [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/960 (https://phabricator.wikimedia.org/T403964) (owner: 10fnegri) [18:08:39] RESOLVED: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [18:54:33] 10Tool-archive-externa-links, 10Wikidata, 10Wikidata-Gadgets: [Documentation] Modification de la ligne de code d'importation de script utilisateur ArchiveExternaLinks - https://phabricator.wikimedia.org/T404095#11164657 (10poro26) [19:08:56] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [19:16:23] 10Tool-quickcategories, 10MediaWiki-Core-AuthManager, 10MediaWiki-extensions-OAuth, 06MediaWiki-Platform-Team: Several mwapi (Python) based tools are failing to edit: badtoken: Invalid CSRF token. - https://phabricator.wikimedia.org/T403519#11164854 (10matmarex) I set up the OAuth Hello World! app (https:/... [19:30:28] RESOLVED: PuppetStaleCertificates: Found non-revoked Puppet certificates for 1 deleted instances on tools-puppetserver-01 - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/PuppetStaleCertificates - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetStaleCertificates [19:30:31] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#11164904 (10Raymond_Ndibe) [19:30:49] 10cloud-services-team (FY2025/26-Q1), 10Toolforge (Toolforge iteration 24), 05Goal: [harbor] Move harbor data to object storage service - https://phabricator.wikimedia.org/T350687#11164906 (10Raymond_Ndibe) 05In progress→03Resolved [19:48:41] RESOLVED: CloudVPSDesignateLeaks: Detected 4 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:47:22] FIRING: HAProxyBackendUnavailable: HAProxy service radosgw-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [20:52:22] RESOLVED: HAProxyBackendUnavailable: HAProxy service radosgw-api_backend backend cloudcontrol1011.private.eqiad.wikimedia.cloud is DOWN - https://wikitech.wikimedia.org/wiki/HAProxy - TODO - https://alerts.wikimedia.org/?q=alertname%3DHAProxyBackendUnavailable [21:39:29] 06Toolforge-standards-committee: Adoption request for fireflytools - https://phabricator.wikimedia.org/T403814#11165389 (10bd808) https://en.wikipedia.org/wiki/Wikipedia:Arbitration_Committee/Noticeboard/Archive_14#Resignation_of_Firefly;_effective_Dec_31,_2024 seems likely related. [21:43:31] 10Cloud-VPS (Quota-requests), 10Release-Engineering-Team (Radar): Additional floating IPs for gitlab-cloud-runner testing in testlabs project - https://phabricator.wikimedia.org/T404150 (10dduvall) 03NEW [22:37:24] (03PS1) 10BryanDavis: docker: Simplify Bitu setup instructions [labs/striker] - 10https://gerrit.wikimedia.org/r/1186628 [22:37:24] (03PS1) 10BryanDavis: phab_attach: Give notice when Phabricator account is already in use [labs/striker] - 10https://gerrit.wikimedia.org/r/1186629 (https://phabricator.wikimedia.org/T319500) [22:42:11] 06cloud-services-team, 10Striker, 13Patch-For-Review: Attaching Phabricator account to a second Developer account via Striker results in a fatal error - https://phabricator.wikimedia.org/T319500#11165588 (10bd808) 05Open→03In progress p:05Triage→03Medium a:03bd808 [23:00:50] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [lima-kilo] fix permission of tool's home dir - https://phabricator.wikimedia.org/T403513#11165623 (10Raymond_Ndibe) 05Open→03Invalid [23:01:54] 10Toolforge (Toolforge iteration 24), 13Patch-For-Review: [lima-kilo] fix permission of tool's home dir - https://phabricator.wikimedia.org/T403513#11165636 (10Raymond_Ndibe) 05Invalid→03Resolved [23:02:41] 10Toolforge (Toolforge iteration 24): [components-api] Queue builds when the build queue is full - https://phabricator.wikimedia.org/T402568#11165637 (10Raymond_Ndibe) a:03Raymond_Ndibe [23:23:55] (03update) 10vriaa: feat: Make editor responsive [toolforge-repos/centralnotice-banner-editor] - 10https://gitlab.wikimedia.org/toolforge-repos/centralnotice-banner-editor/-/merge_requests/21 [23:40:13] 06cloud-services-team, 10Toolforge, 07Upstream: `webservice shell` sometimes loses output - https://phabricator.wikimedia.org/T403286#11165734 (10bd808) [23:42:58] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: Missing Perl packages on dev.toolforge.org for anomiebot workflows - https://phabricator.wikimedia.org/T360488#11165738 (10bd808) a:05bd808→03None I need to unlick this cookie for now. If anyone wants to take over https://gitlab.wikimedia.org/toolfo... [23:46:54] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 13Patch-For-Review: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11165749 (10bd808) @taavi Can I help you prepare a new release of tofu-cloudvps so that I ca... [23:47:57] 10cloud-services-team (FY2025/26-Q1), 10Cloud-VPS, 07Documentation: [tofu-cloudvps] Document using `cloudvps_puppet_project` to manage project-wide and instance specific puppet classes and hiera settings - https://phabricator.wikimedia.org/T397994#11165750 (10bd808) 05In progress→03Resolved [23:55:34] 10Tool-bridgebot, 07Upstream: Bridgebot freaks out and sends double messages from IRC to Telegram - https://phabricator.wikimedia.org/T305487#11165762 (10bd808) Still waiting on a new upstream release. Things are not looking great at this point with bug reports and PRs accumulating upstream with no sign of the... [23:55:42] 10Toolforge (Toolforge iteration 24): [builds-api, maintain-harbor] fix build/image cleanup - https://phabricator.wikimedia.org/T404157 (10Raymond_Ndibe) 03NEW [23:55:45] 10Tool-bridgebot, 07Upstream: Bridgebot freaks out and sends double messages from IRC to Telegram - https://phabricator.wikimedia.org/T305487#11165775 (10bd808) a:05bd808→03None [23:59:22] 10Wikibugs, 10GitLab (Integrations), 10Release-Engineering-Team (Priority Backlog 📥): Connect WikiBugs IRC bot to Wikimedia GitLab - https://phabricator.wikimedia.org/T288381#11165776 (10bd808) 05Open→03Resolved Let's call this done. {T364615} tracks the biggest missing aspect which itself needs {T36...