[00:41:33] (03update) 10raymond-ndibe: [deployment] add config to deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/112 (https://phabricator.wikimedia.org/T400064) [01:10:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [01:50:56] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [01:51:07] 06cloud-services-team: SystemdUnitDown The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://phabricator.wikimedia.org/T401363 (10phaultfinder) 03NEW [04:20:56] RESOLVED: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1001. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [04:20:56] RESOLVED: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1001 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [06:25:38] 06cloud-services-team, 10Toolforge, 10ISA, 03Wikimania-Hackathon-2025: Request to transfer isa-tool GitHub repository to toolforge organization - https://phabricator.wikimedia.org/T401304#11067184 (10Dactylantha) Apologies for the lack of introduction - we are at Wikimania and are working on changes to the... [06:35:02] 06cloud-services-team, 10Striker: 500 Internal Server Error when trying to access ssh keys on toolsadmin - https://phabricator.wikimedia.org/T401318#11067188 (10Soni) I'm not sure if someone else added my key, or I added it somehow. But my correct key shows up on https://idm.wikimedia.org/keymanagement/ now. I... [07:34:21] 06cloud-services-team, 10Striker: Rebuild Striker demo server - https://phabricator.wikimedia.org/T329687#11067215 (10Aklapper) @Arendpieter: I don't see how this has anything to do with Phab code or settings? [07:57:21] 06cloud-services-team, 10Toolforge, 10ISA, 03Wikimania-Hackathon-2025: Request to transfer isa-tool GitHub repository to toolforge organization - https://phabricator.wikimedia.org/T401304#11067265 (10taavi) Any maintainer of the tool can create a GitLab repository via https://toolsadmin.wikimedia.org/tools... [08:05:56] (03open) 10dcaro: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 [08:14:27] (03update) 10dcaro: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 [08:16:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-67 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [08:27:44] 06cloud-services-team, 10Striker: 500 Internal Server Error when trying to access ssh keys on toolsadmin - https://phabricator.wikimedia.org/T401318#11067394 (10dcaro) Ldap is still giving the old one that starts with `AAAAB4NzaC1yc2EAAAADAQABAAABAQCaBSmjYcMzZ9...` (https://ldap.toolforge.org/user/soni), maybe... [08:44:01] 06cloud-services-team, 10Striker: 500 Internal Server Error when trying to access ssh keys on toolsadmin - https://phabricator.wikimedia.org/T401318#11067442 (10Soni) >>! In T401318#11067188, @Soni wrote: > I'm not sure if someone else added my key, or I added it somehow. But my correct key shows up on https:/... [08:47:46] (03update) 10dcaro: global: first commit [repos/cloud/toolforge/logs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/logs-api/-/merge_requests/1 (https://phabricator.wikimedia.org/T127367) [08:50:36] (03CR) 10David Caro: [C:03+2] deploy: skip buster bastion when deploying webservice [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1176291 (owner: 10David Caro) [08:55:40] (03Merged) 10jenkins-bot: deploy: skip buster bastion when deploying webservice [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1176291 (owner: 10David Caro) [09:05:18] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: Persist important toolforge k8s components logs - https://phabricator.wikimedia.org/T383081#11067508 (10dcaro) 05In progress→03Resolved Will revisit when we decide on {T97861}, this was simple enough and prevents losing relevant logs (it does no... [09:05:41] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: Persist important toolforge k8s components logs - https://phabricator.wikimedia.org/T383081#11067513 (10dcaro) Also updated the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Harbor/maintain-harbor [09:29:08] (03update) 10dcaro: logs: use logs-api for logs [repos/cloud/toolforge/jobs-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/121 [09:30:12] (03open) 10dcaro: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 [09:30:52] (03update) 10dcaro: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 [09:31:00] (03update) 10dcaro: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 [09:31:54] (03update) 10dcaro: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 [09:35:09] (03update) 10dcaro: Updates from Bryan's use of the deploy process [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/84 (https://phabricator.wikimedia.org/T400616) (owner: 10bd808) [09:35:36] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [jobs-cli,builds-cli,toolforge-cli,components-cli,envvars-cli,webservice-cli] move the packaging scripts to bookworm - https://phabricator.wikimedia.org/T400616#11067568 (10dcaro) [09:40:44] (03approved) 10dcaro: Updates from Bryan's use of the deploy process [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/84 (https://phabricator.wikimedia.org/T400616) (owner: 10bd808) [09:53:02] 06cloud-services-team, 10Striker: 500 Internal Server Error when trying to access ssh keys on toolsadmin - https://phabricator.wikimedia.org/T401318#11067587 (10taavi) I removed the problematic key from LDAP, pasted here for backup/troubleshooting purposes: ` sshPublicKey: ssh-rsa AAAAB4NzaC1yc2EAAAADAQABAAABA... [09:59:01] 10Toolforge (Toolforge iteration 23): [components-api] bump the openapi version on every change - https://phabricator.wikimedia.org/T401374 (10dcaro) 03NEW [10:00:17] 10Toolforge (Toolforge iteration 23), 13Patch-For-Review: [components-api] store the config used for the deployment in the deployment themselves - https://phabricator.wikimedia.org/T400064#11067613 (10dcaro) [10:11:01] (03open) 10taavi: Drop outdated cmd-checklist tests [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/194 [10:16:35] (03approved) 10dcaro: [deployment] add config to deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/112 (https://phabricator.wikimedia.org/T400064) (owner: 10raymond-ndibe) [10:16:58] (03update) 10dcaro: [deployment] add config to deployment [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/112 (https://phabricator.wikimedia.org/T400064) (owner: 10raymond-ndibe) [10:17:28] (03update) 10dcaro: global: first commit [repos/cloud/toolforge/logs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/logs-api/-/merge_requests/1 (https://phabricator.wikimedia.org/T127367) [10:17:38] (03update) 10dcaro: logs-api: add new component [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/911 [10:17:46] (03update) 10dcaro: logs_api: add the option to enable logs-api [repos/cloud/toolforge/api-gateway] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/75 [10:20:15] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [10:20:57] (03approved) 10dcaro: Drop outdated cmd-checklist tests [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/194 (owner: 10taavi) [10:22:48] (03update) 10taavi: Drop outdated cmd-checklist tests [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/194 [10:32:46] (03merge) 10taavi: Drop outdated cmd-checklist tests [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/194 [10:35:27] (03open) 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620: jobs-api: bump to 0.0.396-20250807103254-610e60b8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/912 [10:36:39] !log taavi@cloudcumin1001 toolsbeta START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [10:37:16] !log taavi@cloudcumin1001 toolsbeta END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:38:05] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.component.deploy for component jobs-api [10:38:36] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-api [10:38:53] (03merge) 10taavi: jobs-api: bump to 0.0.396-20250807103254-610e60b8 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/912 (owner: 10group_203_bot_f4d95069bb2675e4ce1fff090c1c1620) [10:52:56] FIRING: SystemdUnitDown: The service unit kiwix-mirror-update.service is in failed status on host clouddumps1002. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [11:16:47] 06cloud-services-team, 10Striker: 500 Internal Server Error when trying to access ssh keys on toolsadmin - https://phabricator.wikimedia.org/T401318#11067831 (10Soni) It works now! Added my key through https://toolsadmin.wikimedia.org/profile/settings/ssh-keys/ and now I can ssh. Thank you! I suspect there'... [11:25:48] 06cloud-services-team, 10Toolforge, 10ISA, 03Wikimania-Hackathon-2025: Request to transfer isa-tool GitHub repository to toolforge organization - https://phabricator.wikimedia.org/T401304#11067849 (10Dactylantha) I think we have resolved this - thank you! [11:25:50] 06cloud-services-team, 10Striker: 500 Internal Server Error when trying to access ssh keys on toolsadmin - https://phabricator.wikimedia.org/T401318#11067850 (10SLyngshede-WMF) @Soni I suspect that you are correct, ssh-keygen even claims that it's not a valid public key. [11:26:05] 06cloud-services-team, 10Toolforge, 10ISA, 03Wikimania-Hackathon-2025: Request to transfer isa-tool GitHub repository to toolforge organization - https://phabricator.wikimedia.org/T401304#11067852 (10Dactylantha) 05Open→03Resolved a:03Dactylantha [12:12:18] 06cloud-services-team, 10Toolforge: `toolforge jobs logs` appears to break on utf-8 characters - https://phabricator.wikimedia.org/T401242#11067995 (10taavi) Is this still happening with the backend migration to Loki done? [12:25:13] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-eqiad, 06SRE: hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11068026 (10Jclark-ctr) 05Open→03Resolved @fnegri i have removed failed drive. I have installed drive from decom serve... [12:25:45] (03update) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/wd-image-positions] - 10https://gitlab.wikimedia.org/toolforge-repos/wd-image-positions/-/merge_requests/42 [12:25:55] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/23 [12:25:57] (03open) 10l10n-bot: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/8 [12:27:48] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1176460 (owner: 10L10n-bot) [12:27:49] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/massmailer] - 10https://gerrit.wikimedia.org/r/1176464 (owner: 10L10n-bot) [12:34:22] 06cloud-services-team, 10Toolforge: `toolforge jobs logs` appears to break on utf-8 characters - https://phabricator.wikimedia.org/T401242#11068057 (10DamianZaremba) I am currently seeing logs from `--follow` (was nothing before). I will let it run while I have lunch and if it doesn't crash them we can close... [12:46:36] 06cloud-services-team, 10Toolforge: toolforge components - support providing ref in webhook - https://phabricator.wikimedia.org/T401388 (10DamianZaremba) 03NEW [12:47:56] FIRING: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1002 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [12:48:03] 06cloud-services-team: SystemdUnitDown The systemd unit kiwix-mirror-update.service on node clouddumps1002 has been failing for more than two hours. - https://phabricator.wikimedia.org/T401389 (10phaultfinder) 03NEW [12:54:07] 06cloud-services-team, 10Toolforge: toolforge components - support providing ref in webhook - https://phabricator.wikimedia.org/T401388#11068145 (10taavi) Note that this has a non-obvious security implication: Currently deployment tokens are relatively harmless as they only let you trigger a new build off of t... [13:10:19] 06cloud-services-team, 10Toolforge: Logs that are present in `--follow` are missing when not `--follow` - https://phabricator.wikimedia.org/T401244#11068197 (10DamianZaremba) 05Open→03Resolved a:03DamianZaremba After T400916 there is only 1 backend now so this shouldn't be possible. Closing for now. [13:14:26] 06cloud-services-team, 10Toolforge: toolforge components - support providing ref in webhook - https://phabricator.wikimedia.org/T401388#11068231 (10DamianZaremba) This is a good point regarding refs, however the same is true if for example an SSH key or LDAP password was compromised. Having other authenticati... [13:17:56] FIRING: SystemdUnitDown: The service unit backup_glance_images.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [13:18:43] 06cloud-services-team, 10Toolforge: `toolforge jobs logs` appears to break on utf-8 characters - https://phabricator.wikimedia.org/T401242#11068243 (10DamianZaremba) It has been running for ~45min without a crash or hang, so this appears to be resolved. [13:18:50] 06cloud-services-team, 10Toolforge: `toolforge jobs logs` appears to break on utf-8 characters - https://phabricator.wikimedia.org/T401242#11068244 (10DamianZaremba) 05Open→03Resolved a:03DamianZaremba [13:38:41] FIRING: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:52:56] RESOLVED: SystemdUnitDown: The service unit backup_glance_images.service is in failed status on host cloudbackup1003. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=cloudbackup1003 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [14:12:17] 06cloud-services-team, 10Cloud-VPS: cloudbackup100[34] still do not actually do 'backy2 cleanup - https://phabricator.wikimedia.org/T394618#11068437 (10Andrew) [14:13:42] 06cloud-services-team, 10Cloud-VPS: cloudbackup100[34] still do not actually do 'backy2 cleanup - https://phabricator.wikimedia.org/T394618#11068445 (10Andrew) [14:18:41] RESOLVED: CloudVPSDesignateLeaks: Detected 3 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:20:48] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-67 [14:26:46] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-67 [14:33:34] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [14:34:51] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [14:36:06] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [14:37:54] 06cloud-services-team, 10Toolforge, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod - https://phabricator.wikimedia.org/T348662#11068550 (10Andrew) [14:39:09] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [14:40:20] (03approved) 10dcaro: d/changelog: bump to 0.3.7 [repos/cloud/toolforge/toolforge-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/merge_requests/50 (owner: 10raymond-ndibe) [14:40:35] (03approved) 10dcaro: d/changelog: bump to 0.0.23 [repos/cloud/toolforge/builds-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/merge_requests/117 (owner: 10raymond-ndibe) [14:41:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-67 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:41:14] (03unapproved) 10dcaro: d/changelog: bump to 0.3.7 [repos/cloud/toolforge/toolforge-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/merge_requests/50 (owner: 10raymond-ndibe) [14:41:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-67 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:43:55] (03update) 10raymond-ndibe: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) [14:46:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-67 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:47:33] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-67 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:52:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-67 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [14:55:43] (03open) 10dcaro: pre-commit: add check for openapi spec version bump [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/116 [14:57:03] (03update) 10dcaro: pre-commit: add check for openapi spec version bump [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/116 [14:57:25] (03update) 10dcaro: pre-commit: add check for openapi spec version bump [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/116 [15:03:11] (03approved) 10dcaro: d/changelog: bump to 0.3.7 [repos/cloud/toolforge/toolforge-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-cli/-/merge_requests/50 (owner: 10raymond-ndibe) [15:28:09] (03merge) 10bd808: Updates from Bryan's use of the deploy process [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/84 (https://phabricator.wikimedia.org/T400616) [16:12:41] 06cloud-services-team, 10Toolforge: Ease `toolforge_weld` usage from within pods - https://phabricator.wikimedia.org/T401065#11069046 (10DamianZaremba) The cert does appear to get rotated? The trigger job started failing with: ` Exception: Failed to create coord-legacy-report-interface-import: [400] RESOLVED: SystemdUnitDown: The systemd unit kiwix-mirror-update.service on node clouddumps1002 has been failing for more than two hours. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/SystemdUnitDown - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=clouddumps1002 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitDown [16:28:18] (03open) 10bd808: build_deb fixes [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/85 [16:28:23] (03update) 10bd808: build_deb fixes [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/85 [16:28:55] (03update) 10bd808: Updates from Bryan's use of the deploy process [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/84 (https://phabricator.wikimedia.org/T400616) [16:29:18] (03update) 10bd808: build_deb fixes [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/85 [16:29:28] 06cloud-services-team, 10Toolforge: Ease `toolforge_weld` usage from within pods - https://phabricator.wikimedia.org/T401065#11069138 (10dcaro) Note that toolforge_weld was created to run for internal toolforge service, so it's not meant to be used by users (that), it's internal to toolforge and might get chan... [16:31:20] (03update) 10dcaro: build_deb fixes [repos/cloud/toolforge/webservice-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/webservice-cli/-/merge_requests/85 (https://phabricator.wikimedia.org/T400616) (owner: 10bd808) [16:33:31] 06cloud-services-team, 10Toolforge: Ease `toolforge_weld` usage from within pods - https://phabricator.wikimedia.org/T401065#11069159 (10DamianZaremba) I'm fine with using `requests` directly and handling what the shim layer does directly. Perhaps there should be a note on https://pypi.org/project/toolforge-w... [16:34:35] 06cloud-services-team, 10Toolforge: Ease toolforge api usage from within pods - https://phabricator.wikimedia.org/T401065#11069162 (10DamianZaremba) [16:35:35] 06cloud-services-team, 10Toolforge: Ease toolforge api usage from within pods - https://phabricator.wikimedia.org/T401065#11069165 (10dcaro) > Perhaps there should be a note on https://pypi.org/project/toolforge-weld/ if this is not intended to be user facing. Yep, we can make it clearer, `Shared Python code... [16:36:35] 06cloud-services-team, 10Toolforge: Ease toolforge api usage from within pods - https://phabricator.wikimedia.org/T401065#11069177 (10DamianZaremba) [16:38:36] (03open) 10dcaro: readme: add note about potential backwards incompatibility [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/86 [16:38:42] (03update) 10dcaro: readme: add note about potential backwards incompatibility [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/86 [16:38:57] (03update) 10dcaro: readme: add note about potential backwards incompatibility [repos/cloud/toolforge/toolforge-weld] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/86 [16:39:23] 06cloud-services-team, 10Toolforge: Ease toolforge api usage from within pods - https://phabricator.wikimedia.org/T401065#11069188 (10DamianZaremba) [16:41:50] 06cloud-services-team, 10Toolforge: Ease toolforge api usage from within pods - https://phabricator.wikimedia.org/T401065#11069222 (10taavi) →14Duplicate dup:03T321919 [16:42:03] 06cloud-services-team, 10Toolforge, 07Documentation, 07Kubernetes: Figure out and document how to call the Kubernetes API as your tool user from inside a pod - https://phabricator.wikimedia.org/T321919#11069224 (10taavi) [17:00:54] 06cloud-services-team, 10Toolforge: toolforge jobs logs api returns 404 on no log entries - https://phabricator.wikimedia.org/T401420 (10DamianZaremba) 03NEW [17:11:28] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [17:12:23] 06cloud-services-team, 10Toolforge: TjfCliError - toolforge jobs logs broken - https://phabricator.wikimedia.org/T401422 (10DamianZaremba) 03NEW [17:13:42] 06cloud-services-team, 10Toolforge: TjfCliError - toolforge jobs logs broken - https://phabricator.wikimedia.org/T401422#11069304 (10DamianZaremba) ClueBot II handles quite large pages on enwiki, I suspect one of the log entries (containing page data) is larger than some limit in the API.... if my guess is cor... [17:19:22] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [17:34:29] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [17:40:56] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [17:53:56] (03update) 10dcaro: runtime.k8s.image: periodically refresh image-config data [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/160 (https://phabricator.wikimedia.org/T357112) (owner: 10raymond-ndibe) [19:48:05] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/23 (owner: 10l10n-bot) [19:48:11] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/ranker] - 10https://gitlab.wikimedia.org/toolforge-repos/ranker/-/merge_requests/23 (owner: 10l10n-bot) [19:49:26] (03approved) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/8 (owner: 10l10n-bot) [19:49:28] (03merge) 10lucaswerkmeister: Localisation updates from https://translatewiki.net. [toolforge-repos/lexeme-forms] - 10https://gitlab.wikimedia.org/toolforge-repos/lexeme-forms/-/merge_requests/8 (owner: 10l10n-bot) [22:11:16] 10Tool-gitlab-account-approval: Approval job can get stuck and prevent subsequent jobs from firing - https://phabricator.wikimedia.org/T379130#11070009 (10bd808) 05Open→03Resolved a:03bd808 This seems to have been fixed by {T306391} and `timeout: 150` in the job specification. [22:13:11] 10Tool-gitlab-account-approval: Investigate OAuth 2 Resource owner password credentials flow as a replacement for Personal Access Token auth - https://phabricator.wikimedia.org/T358134#11070014 (10bd808) 05Open→03Declined Upstream eventually realized that forcing all PATs to have an expiration unless you... [23:19:19] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11070064 (10bd808) I can haz a reproduction! `counterexample │ Error: Provider produced inconsistent result after apply │ │ When... [23:35:50] 06cloud-services-team, 10Cloud-VPS: [tofu-cloudvps] cloudvps_puppet_prefix.hiera settings show dirty diffs based on YAML canonicalization - https://phabricator.wikimedia.org/T398643#11070105 (10bd808) Once a tainted `cloudvps_puppet_prefix` node is in the state all subsequent `tofu plan` stages want to replace...