[00:11:28] (PuppetAgentStaleLastRun) firing: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:15:28] (InstanceDown) firing: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:16:28] (PuppetAgentStaleLastRun) resolved: Last Puppet run was over 24 hours ago on instance tf-infra-test in project tf-infra-test - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentStaleLastRun [00:20:28] (InstanceDown) resolved: Project tf-infra-test instance tf-infra-test is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [00:42:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:47:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:52:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:57:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [02:59:04] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9716203 (10Raymond_Ndibe) marking as resolved. We can open it again if anyone disagrees [02:59:19] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [builds-api] replace all error message models with ResponseMessages - https://phabricator.wikimedia.org/T361901#9716201 (10Raymond_Ndibe) 05In progress→03Resolved [02:59:41] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9716205 (10Raymond_Ndibe) 05In progress→03Resolved [03:00:52] 10Toolforge (Toolforge iteration 08): [toolforge-cd] remove duplicated run on tag and push to master (just do one if possible) - https://phabricator.wikimedia.org/T353563#9716209 (10Raymond_Ndibe) 05In progress→03Resolved [05:14:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:24:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:14:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:24:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [07:25:08] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9716536 (10KCVelaga_WMF) @fnegri any approximate on when this might be prioritized. This will be very helpful for creati... [08:11:24] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613 (10Curb_Safe_Charmer) 03NEW [08:15:33] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716668 (10Curb_Safe_Charmer) 05Open→03In progress [08:15:45] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716670 (10Curb_Safe_Charmer) p:05Triage→03Medium [08:16:01] 06cloud-services-team, 10Toolforge (Toolforge iteration 08): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9716667 (10dcaro) Today it got full again: ` root@proxy-03:~# df -h /var/lib/nginx... [08:16:04] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716671 (10Curb_Safe_Charmer) Tried restarting refill-api web service, but no change. [08:22:39] 06cloud-services-team, 10Striker, 10Data-Persistence-Backup, 06DBA, 13Patch-For-Review: Create a database for Striker test instance - https://phabricator.wikimedia.org/T360149#9716676 (10ABran-WMF) 05Open→03Resolved [08:28:36] (03CR) 10Muehlenhoff: [C:03+1] delete graphite.discovery.wmnet dummy key [labs/private] - 10https://gerrit.wikimedia.org/r/1019889 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [08:29:57] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716691 (10Curb_Safe_Charmer) Following steps that @TheresNoTime left in T359159 to identify problem with pod. [08:35:37] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9716712 (10dcaro) @fnegri just verifying, the `quarry_readonly` user only has to have access to the public databases (no... [08:39:15] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716737 (10Curb_Safe_Charmer) tools.refill-api@tools-sgebastion-10:~$ kubectl get pods NAME READY STATUS RESTARTS AGE refill-api-79cc6f87b4-92p9k... [08:41:24] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [infra,builds-builder] "failed to create fsnotify watcher: too many open files" - https://phabricator.wikimedia.org/T361519#9716740 (10dcaro) I'll close this for now, but please re-open (or create a new task) if the issues come back [08:42:01] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [infra,builds-builder] "failed to create fsnotify watcher: too many open files" - https://phabricator.wikimedia.org/T361519#9716742 (10dcaro) 05In progress→03Resolved [08:48:30] 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 10vm-requests, 13Patch-For-Review: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9716756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host cloudidm2001... [08:51:27] 10Wikibugs: Hashar does not like grey foreground color for distinguishing closed status events - https://phabricator.wikimedia.org/T360353#9716792 (10Peachey88) I'm not the biggest fan of the strike though as it makes the text harder to read (probably worse than the grey, but that might be a personal prefere... [08:53:04] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453#9716800 (10Volans) [08:54:31] !log taavi@cloudcumin1001 tools START - Cookbook wmcs.toolforge.apt.copy_to_main_repo for package 'python3-toolforge-weld' version '1.5.0' [08:54:41] !log taavi@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.apt.copy_to_main_repo (exit_code=0) for package 'python3-toolforge-weld' version '1.5.0' [08:55:22] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716806 (10Curb_Safe_Charmer) Used kubectl scale to terminate all pods. No change - still 'Waiting for an available worker.' @TheresNoTime morning Sammy, suggestions on what to try next, or are you aro... [08:55:44] 06cloud-services-team, 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773#9716795 (10Volans) Merging this with T346453 as the testing plan outlined in T346453#9713036 will cover... [08:55:49] 06cloud-services-team, 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773#9716798 (10Volans) →14Duplicate dup:03T346453 [08:58:40] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716815 (10dcaro) @Curb_Safe_Charmer I don't see anything wrong with the k8s side, the tool pod is up and running, responding on https://refill-api.toolforge.org/, so can you elaborate on the error you... [09:02:55] 06cloud-services-team, 10Striker, 10Data-Persistence-Backup, 06DBA, 13Patch-For-Review: Create a database for Striker test instance - https://phabricator.wikimedia.org/T360149#9716836 (10ABran-WMF) 05Resolved→03Open @taavi would you be OK to drop & recreate that db so we discard the underscore? [09:06:11] 06cloud-services-team, 10Striker, 10Data-Persistence-Backup, 06DBA, 13Patch-For-Review: Create a database for Striker test instance - https://phabricator.wikimedia.org/T360149#9716871 (10taavi) Sure, just let me know what the new name is so I can update the app config. [09:06:30] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716877 (10Curb_Safe_Charmer) @dcaro it has started working now! Not sure why. [09:07:37] 06cloud-services-team, 10Striker, 10Data-Persistence-Backup, 06DBA, 13Patch-For-Review: Create a database for Striker test instance - https://phabricator.wikimedia.org/T360149#9716879 (10Marostegui) >>! In T360149#9716836, @ABran-WMF wrote: > @taavi would you be OK to drop & recreate that db so we discar... [09:08:34] 10Toolforge: Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621 (10taavi) 03NEW p:05Triage→03High [09:08:44] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716910 (10dcaro) xd, it's a [[ https://en.wikipedia.org/wiki/Heisenbug | heisenbug ]]! [09:08:46] 10Toolforge: Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621#9716912 (10taavi) p:05High→03Triage [09:08:57] 10Toolforge: Support HTTP health checks in jobs framework - https://phabricator.wikimedia.org/T362621#9716915 (10taavi) [09:09:04] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-api,jobs-cli] Support job health checks - https://phabricator.wikimedia.org/T335592#9716914 (10taavi) [09:09:05] 10Toolforge, 07Epic: [jobs-api,webservice] Run webservices via the jobs framework - https://phabricator.wikimedia.org/T348755#9716916 (10taavi) [09:21:16] 06cloud-services-team, 06Infrastructure-Foundations, 06SRE, 10vm-requests, 13Patch-For-Review: Site: 1 VM for codfw1dev bitu deployment - https://phabricator.wikimedia.org/T362128#9716947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host cloudidm2001-dev... [09:29:51] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9716981 (10Curb_Safe_Charmer) 05In progress→03Resolved [09:30:00] 06cloud-services-team, 10Cumin, 06Infrastructure-Foundations: Cumin: create external backend for WMCS Puppet API - https://phabricator.wikimedia.org/T179816#9716983 (10Volans) [09:35:27] 10Toolforge (Toolforge iteration 08): [cicd,infra] pre-cache all the pre-commit hooks - https://phabricator.wikimedia.org/T362314#9717020 (10dcaro) Note also {T362314}, that is impacting the times to build things. Anyhow, this has been alleviated in two ways: * Creating a ci runner image that pre-caches the pre... [09:35:51] 10Toolforge (Toolforge iteration 08): [cicd,infra] pre-cache all the pre-commit hooks - https://phabricator.wikimedia.org/T362314#9717022 (10dcaro) 05In progress→03Resolved [09:36:14] 10Toolforge (Toolforge iteration 08): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9717023 (10dcaro) @MBH were you able to setup the tunnel properly? [09:39:33] 06cloud-services-team, 10Toolforge (Toolforge iteration 08): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9717049 (10dcaro) a:03dcaro [09:40:14] 06cloud-services-team, 10Toolforge (Toolforge iteration 08): Harbor uploads sometimes fail due to tmpfs space on project-proxy - https://phabricator.wikimedia.org/T354116#9717051 (10dcaro) 05Open→03In progress [09:40:18] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [infra] Add alert when workers have a sustained large amount of D processes - https://phabricator.wikimedia.org/T362093#9717057 (10CodeReviewBot) dcaro merged https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/11 kubernetes: add... [09:42:41] (CloudVPSDesignateLeaks) firing: (2) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:47:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:52:41] (CloudVPSDesignateLeaks) firing: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:56:02] 06cloud-services-team, 10Toolforge, 10Cumin, 06Infrastructure-Foundations: Allow interacting with Toolforge PuppetDB from wmcs-cookbooks - https://phabricator.wikimedia.org/T362629 (10taavi) 03NEW [09:57:41] (CloudVPSDesignateLeaks) resolved: (3) Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:58:13] (03CR) 10David Caro: [C:03+1] "LGTM" [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1018272 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [09:58:54] (03CR) 10Majavah: [C:03+2] vps: remove_instance: Silence metricsinfra alerts before deleting [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1018272 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:01:16] 10Tool-refill: Refill tool stuck "waiting for an available worker" - https://phabricator.wikimedia.org/T362613#9717171 (10TheresNoTime) Glad its resolved @Curb_Safe_Charmer :-) I'll do some digging a bit later to try to figure out what happened this time [10:02:06] (03Merged) 10jenkins-bot: vps: remove_instance: Silence metricsinfra alerts before deleting [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1018272 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:06:25] (03CR) 10FNegri: [C:03+1] build: Remove explicit types-requests pin [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1019728 (owner: 10Majavah) [10:07:03] (03CR) 10Majavah: [C:03+2] build: Require Spicerack 8.5 or newer [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1019723 (owner: 10Majavah) [10:07:06] (03CR) 10Majavah: [C:03+2] build: Remove explicit types-requests pin [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1019728 (owner: 10Majavah) [10:10:18] (03Merged) 10jenkins-bot: build: Require Spicerack 8.5 or newer [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1019723 (owner: 10Majavah) [10:10:19] (03Merged) 10jenkins-bot: build: Remove explicit types-requests pin [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1019728 (owner: 10Majavah) [10:14:41] 10Toolforge (Toolforge iteration 08): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9717211 (10MBH) I don't and I won't be able to try until May 6, I'm not at home. [10:15:55] 10Toolforge (Toolforge iteration 08): I can't connect to Toolforge DB replicas from my PC using MySQL Workbench - https://phabricator.wikimedia.org/T360839#9717219 (10dcaro) okok, let me know when you do :) [10:30:46] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9717305 (10fnegri) @KCVelaga_WMF I'm sorry there was no progress on this so far, it is still in my backlog. I plan to fi... [10:45:03] (ToolforgeKubernetesWorkerTooManyDProcesses) firing: Kubernetes worker tools-k8s-worker-nfs-1 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [11:07:33] !log dcaro@urcuchillay tools START - Cookbook wmcs.toolforge.k8s.worker.drain for node tools-k8s-worker-nfs-1 [11:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:08:17] !log dcaro@urcuchillay tools END (PASS) - Cookbook wmcs.toolforge.k8s.worker.drain (exit_code=0) for node tools-k8s-worker-nfs-1 [11:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:55:03] (ToolforgeKubernetesWorkerTooManyDProcesses) resolved: Kubernetes worker tools-k8s-worker-nfs-1 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:07:09] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9717627 (10KCVelaga_WMF) @fnegri that'll be amazing, thank you! Also, a quick question, does this also enable [[ https:/... [12:09:07] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Taavi knowledge transfer: cloud-vps monitoring - https://phabricator.wikimedia.org/T362452#9717632 (10dcaro) [12:15:42] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9717683 (10fnegri) Superset must be configured separately, but it can reuse the same credentials. [12:23:18] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9717714 (10dcaro) >>! In T348407#9717305, @fnegri wrote: > @KCVelaga_WMF I'm sorry there was no progress on this so far,... [12:27:16] 10Toolforge: [jobs-cli,components-api] Provide YAML schema file for toolforge-jobs definition files - https://phabricator.wikimedia.org/T314729#9717744 (10dcaro) [12:27:23] 10Cloud Services Proposals, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, and 3 others: [Epic,builds-api,component-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build p... - https://phabricator.wikimedia.org/T194332#9717745 [12:27:30] 10Cloud Services Proposals, 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge, 05Cloud-Services-Origin-Team, and 3 others: [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build ... - https://phabricator.wikimedia.org/T194332#9717746 [12:28:41] 10Toolforge, 07Epic: [components-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#9717752 (10dcaro) [12:28:53] 10Toolforge: [components-api] add one-off, scheduled and continuous jobs support to the yaml + api - https://phabricator.wikimedia.org/T362075#9717754 (10dcaro) [12:28:58] 10Toolforge: [components-api] Get a minimal version of the config with build-only data - https://phabricator.wikimedia.org/T362070#9717767 (10dcaro) [12:29:07] 10Toolforge: [components-api] Get a skeleton of API webservice and implement `/tool//deploy` with build-only features - https://phabricator.wikimedia.org/T362069#9717768 (10dcaro) [12:29:16] 10Toolforge: [components-api] Develop the webhook mechanism to trigger a deployment - https://phabricator.wikimedia.org/T362066#9717769 (10dcaro) [12:29:35] 10Toolforge: [components-api] Extend the list of build triggers (unrefined) - https://phabricator.wikimedia.org/T362071#9717771 (10dcaro) [12:29:45] 10Toolforge: [components-api] Add support for non-public services - https://phabricator.wikimedia.org/T362072#9717772 (10dcaro) [12:29:50] 10Toolforge: [components-api] Add support for pre-build images (to refine) - https://phabricator.wikimedia.org/T362076#9717774 (10dcaro) [12:29:57] 10Toolforge: [components-api] Add webservice support (to refine) - https://phabricator.wikimedia.org/T362077#9717775 (10dcaro) [12:31:44] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Data-Services, 10Quarry: Create db user for Quarry with readonly access to public ToolsDB databases - https://phabricator.wikimedia.org/T348407#9717779 (10fnegri) > We might want to give a different user to avoid confusion (ex. who is running this huge query that... [13:01:42] 06cloud-services-team, 10Cloud-VPS, 10Toolforge: Taavi knowledge transfer: toolforge job investigation - https://phabricator.wikimedia.org/T362446#9717869 (10dcaro) [13:08:14] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9717912 (10Papaul) a:05Jhancock.wm→03Papaul [13:13:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:28:41] (CloudVPSDesignateLeaks) resolved: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [13:40:47] 10Toolforge, 07Security: builds-api allows impersonating any user by bypassing local TLS termination - https://phabricator.wikimedia.org/T362525#9718084 (10aborrero) also, the request doesn't use a TLS certificate on the client side. By looking at the nginx deployment, it has `ssl_verify_client on`, I woul... [13:51:18] 06cloud-services-team, 10decommission-hardware, 10ops-codfw, 06SRE, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718113 (10Jhancock.wm) @Papaul @Andrew what are we doing with cloudbackup2001-array1 and cloudbackup2002-array1? [13:55:35] 10Toolforge (Toolforge iteration 08): [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718156 (10fnegri) a:03fnegri [13:57:34] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718157 (10CodeReviewBot) fnegri opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/29 Output logs to stderr instead of stdout [13:58:46] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718161 (10fnegri) 05Open→03In progress [13:59:05] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718164 (10taavi) Related: {T359963} [14:00:27] 06cloud-services-team, 10Toolforge, 10Cumin, 06Infrastructure-Foundations: Allow interacting with Toolforge PuppetDB from wmcs-cookbooks - https://phabricator.wikimedia.org/T362629#9718190 (10Volans) The change would not be very small as to make it general we would need to make cumin support multiple insta... [14:02:00] 06cloud-services-team, 10Toolforge, 10Cumin, 06Infrastructure-Foundations: Allow interacting with Toolforge PuppetDB from wmcs-cookbooks - https://phabricator.wikimedia.org/T362629#9718200 (10Volans) p:05Triage→03Low [14:02:42] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718213 (10fnegri) [14:03:13] 10Toolforge: [harbor, builds-builder] Audit robot account permissions - https://phabricator.wikimedia.org/T361708#9718215 (10dcaro) [14:04:00] 06cloud-services-team, 10decommission-hardware, 10ops-codfw, 06SRE, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718218 (10Andrew) >>! In T362438#9718112, @Jhancock.wm wrote: > what are we doing with cloudbackup2001-array1 and cloudbackup200... [14:10:30] 10Toolforge (Toolforge iteration 08): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#9718250 (10dcaro) 05Open→03Stalled [14:22:10] 10Toolforge, 07Security: builds-api allows impersonating any user by bypassing local TLS termination - https://phabricator.wikimedia.org/T362525#9718346 (10dcaro) >>! In T362525#9718084, @aborrero wrote: > also, the request doesn't use a TLS certificate on the client side. By looking at the nginx deploymen... [14:25:34] 10Toolforge (Toolforge iteration 08), 13Patch-For-Review: [builds-api, envvars-api] add oapi-codegen installation to makefile - https://phabricator.wikimedia.org/T362290#9718360 (10dcaro) p:05Triage→03Medium [14:26:12] 10Toolforge (Toolforge iteration 09): [infra] Add alert when workers have a sustained large amount of D processes - https://phabricator.wikimedia.org/T362093#9718365 (10dcaro) [14:26:43] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [builds-api,jobs-api,envvars-api,api-gateway] Figure out and document how to do non-backwards compatible changes - https://phabricator.wikimedia.org/T356974#9718363 (10dcaro) [14:27:06] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [jobs-api,jobs-cli] Support services in jobs - https://phabricator.wikimedia.org/T348758#9718371 (10dcaro) [14:27:10] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [builds-api, envvars-api] add oapi-codegen installation to makefile - https://phabricator.wikimedia.org/T362290#9718361 (10dcaro) [14:27:46] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718367 (10dcaro) [14:27:49] 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [k8s] Add node anti-affinity topologySpreadConstraints to infrastructure components where relevant - https://phabricator.wikimedia.org/T358203#9718369 (10dcaro) [14:29:50] 06cloud-services-team, 10decommission-hardware, 10ops-codfw, 06SRE, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718387 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [14:30:11] 06cloud-services-team, 10decommission-hardware, 10ops-codfw, 06SRE, 13Patch-For-Review: decommission cloudbackup200[12].codfw.wmnet - https://phabricator.wikimedia.org/T362438#9718400 (10Jhancock.wm) ty! [14:32:08] 06cloud-services-team, 10wikitech.wikimedia.org, 13Patch-For-Review: Disable email address changes in Wikitech - https://phabricator.wikimedia.org/T360883#9718409 (10taavi) 05Open→03Resolved [14:55:37] 10VPS-project-Codesearch, 06serviceops, 13Patch-For-Review: Add docker production images repo to codesearch - https://phabricator.wikimedia.org/T362567#9718568 (10Scott_French) operations/docker-images/production-images is now available in codesearch. [14:55:40] 10VPS-project-Codesearch, 06serviceops, 13Patch-For-Review: Add docker production images repo to codesearch - https://phabricator.wikimedia.org/T362567#9718569 (10Scott_French) 05Open→03Resolved [15:14:56] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services, 13Patch-For-Review: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9718698 (10hashar) [15:16:01] 10cloud-services-team (FY2023/2024-Q3-Q4), 10Toolforge (Toolforge iteration 09), 13Patch-For-Review: [jobs-cli] output logs on stderr - https://phabricator.wikimedia.org/T362153#9718730 (10CodeReviewBot) fnegri merged https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/29 Output log... [15:17:13] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services, 13Patch-For-Review: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9718714 (10hashar) `gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud` can be deleted: was shutdown because Gerrit used T330312. I used... [15:21:15] 10Toolforge, 10Mismatch Finder: rsync missing from dev.toolforge.org - https://phabricator.wikimedia.org/T362679 (10Lucas_Werkmeister_WMDE) 03NEW [15:21:55] 10Toolforge, 10Mismatch Finder: rsync missing from dev.toolforge.org - https://phabricator.wikimedia.org/T362679#9718785 (10Lucas_Werkmeister_WMDE) [15:23:50] 10Toolforge: Cannot connect to dev.toolforge.org using Mosh with custom locale - https://phabricator.wikimedia.org/T362680 (10Lucas_Werkmeister_WMDE) 03NEW [15:27:04] 10Toolforge, 10Mismatch Finder: rsync missing from dev.toolforge.org - https://phabricator.wikimedia.org/T362679#9718819 (10dcaro) Might be related https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012615 [15:27:32] 10Toolforge: Cannot connect to dev.toolforge.org using Mosh with custom locale - https://phabricator.wikimedia.org/T362680#9718835 (10Lucas_Werkmeister_WMDE) I also get a locale error (“manpath: can't set the locale; make sure $LC_* and $LANG are correct”) when connecting to toolforge-dev with normal SSH: `lang... [15:29:53] 10Toolforge (Software install/update), 10Mismatch Finder: rsync missing from dev.toolforge.org - https://phabricator.wikimedia.org/T362679#9718843 (10JJMC89) [15:34:10] 10Toolforge: Cannot connect to dev.toolforge.org using Mosh with custom locale - https://phabricator.wikimedia.org/T362680#9718881 (10dcaro) Maybe related https://gerrit.wikimedia.org/r/c/operations/puppet/+/1012615 [15:37:45] 10Toolforge: Cannot connect to dev.toolforge.org using Mosh with custom locale - https://phabricator.wikimedia.org/T362680#9718910 (10Lucas_Werkmeister_WMDE) > Previous bastions had (due to the grid) all the locales which I'd like to avoid. I guess that’s what I used to rely on. If it isn’t going to be supporte... [15:39:26] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update puppet-dev project puppetmaster - https://phabricator.wikimedia.org/T361593#9718920 (10Andrew) 05Open→03Resolved a:03Andrew 10:28 AM taavi, jhathaway, moritzm, is the puppet-dev project effectively defunct now that jbond has depar... [15:40:10] (03CR) 10Krinkle: [C:03+2] frontend: Implement /_health [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1017384 (owner: 10Krinkle) [15:40:15] (03CR) 10Krinkle: [C:03+2] app.py: redirect _health to frontend, frontend: add link to view JSON [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1017412 (owner: 10Krinkle) [15:40:19] (03CR) 10Krinkle: [C:03+2] app.py: remove DESCRIPTIONS from old UI, add "Hound" to title [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1017413 (owner: 10Krinkle) [15:42:41] (CloudVPSDesignateLeaks) firing: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:44:34] (03Merged) 10jenkins-bot: frontend: Implement /_health [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1017384 (owner: 10Krinkle) [15:44:35] (03Merged) 10jenkins-bot: app.py: redirect _health to frontend, frontend: add link to view JSON [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1017412 (owner: 10Krinkle) [15:44:48] (03Merged) 10jenkins-bot: app.py: remove DESCRIPTIONS from old UI, add "Hound" to title [labs/codesearch] - 10https://gerrit.wikimedia.org/r/1017413 (owner: 10Krinkle) [16:01:24] 10Toolforge (Toolforge iteration 08): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690 (10dcaro) 03NEW [16:01:31] 10Toolforge (Toolforge iteration 08): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9719130 (10dcaro) p:05Triage→03High [16:01:56] 10Toolforge (Toolforge iteration 08): [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690#9719142 (10dcaro) [16:03:03] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update pki project puppetmaster - https://phabricator.wikimedia.org/T361591#9719151 (10Andrew) a:03Andrew This project was managed by jbond -- for now I will do this upgrade. [16:09:24] 10Wikibugs: Hashar does not like grey foreground color for distinguishing closed status events - https://phabricator.wikimedia.org/T360353#9719202 (10bd808) >>! In T360353#9716792, @Peachey88 wrote: > I'm not the biggest fan of the strike though as it makes the text harder to read (probably worse than the gr... [16:12:41] (CloudVPSDesignateLeaks) firing: (2) Detected 33 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:16:19] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services, 13Patch-For-Review: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9719254 (10Dzahn) [16:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 25 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:54:13] vivian-rook closed https://github.com/toolforge/paws/pull/400 [16:56:33] 06cloud-services-team, 10VPS-project-devtools, 06collaboration-services, 13Patch-For-Review, and 2 others: Update devtools project puppetmaster - https://phabricator.wikimedia.org/T360470#9719580 (10Dzahn) puppetmaster-1003 is down again :/ tried to soft reboot it... [16:58:10] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update pki project puppetmaster - https://phabricator.wikimedia.org/T361591#9719582 (10Andrew) 05Open→03Resolved puppetserver is upgraded but everything in this project is Buster so puppet 7 will be unhappy until that's fixed. [17:01:49] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9719616 (10Papaul) [17:03:49] 10Tools, 06Tech-Docs-Team, 07Documentation, 03Wikimedia-Hackathon-2024: [Hackathon 2024] Improve technical documentation of tools - https://phabricator.wikimedia.org/T358040#9719640 (10TBurmeister) Status update: * Finished post-review changes and removed draft status from the [[ https://www.mediawiki.org/... [17:08:49] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services, 13Patch-For-Review: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9719655 (10Dzahn) After chatting about the puppetdb server with John et al, I shut it down, ran puppet on all agents to confirm there... [17:10:13] 10Cloud-VPS (Debian Buster Deprecation), 06collaboration-services, 13Patch-For-Review: replace buster machines in devtools project - https://phabricator.wikimedia.org/T360964#9719659 (10Dzahn) [18:15:21] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9719964 (10Papaul) [18:58:12] 06cloud-services-team, 10VPS-Projects, 10Puppet (Puppet 7.0): Update mariadbtest project puppetmaster - https://phabricator.wikimedia.org/T361594#9720110 (10Andrew) 05Open→03Resolved a:03Andrew [19:05:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudweb.set_maintenance (T356287) [19:06:02] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [19:06:35] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudweb.set_maintenance (exit_code=99) (T356287) [19:09:01] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices2004.eqiad.wmnet' (T356287) [19:09:01] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices2004.eqiad.wmnet' (T356287) [19:09:24] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices2004.codfw.wmnet' (T356287) [19:09:24] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices2004.codfw.wmnet' (T356287) [19:09:54] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices2004-dev.codfw.wmnet' (T356287[A) [19:13:16] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) on host 'cloudservices2004-dev.codfw.wmnet' (T356287[A) [19:13:22] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [19:14:17] (03PS1) 10Andrew Bogott: upgrade_openstack_node: Upgrade designate db on cloudcontrols [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1020328 [19:15:42] (03CR) 10Andrew Bogott: [C:03+2] upgrade_openstack_node: Upgrade designate db on cloudcontrols [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1020328 (owner: 10Andrew Bogott) [19:18:46] (03Merged) 10jenkins-bot: upgrade_openstack_node: Upgrade designate db on cloudcontrols [cloud/wmcs-cookbooks] - 10https://gerrit.wikimedia.org/r/1020328 (owner: 10Andrew Bogott) [19:20:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices2004-dev.codfw.wmnet' (T356287[A) [19:20:54] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [19:24:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [19:28:24] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices2004-dev.codfw.wmnet' (T356287[A) [19:28:30] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [19:47:28] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudservices2005-dev.codfw.wmnet' (T356287[A) [19:47:35] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [19:55:46] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudservices2005-dev.codfw.wmnet' (T356287[A) [19:55:52] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [19:56:47] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol2001-dev.codfw.wmnet' (T356287) [20:12:43] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol2001-dev.codfw.wmnet' (T356287) [20:12:55] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [20:17:41] (CloudVPSDesignateLeaks) firing: (3) Detected 46 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [20:27:09] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720438 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2... [20:47:34] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol2004-dev.codfw.wmnet' (T356287) [20:47:41] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [21:03:21] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol2004-dev.codfw.wmnet' (T356287) [21:03:29] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [21:15:15] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudcontrol2005-dev.codfw.wmnet' (T356287) [21:15:22] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [21:31:01] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudcontrol2005-dev.codfw.wmnet' (T356287) [21:31:08] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [21:35:35] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet2005-dev.codfw.wmnet' (T356287) [21:43:47] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet2005-dev.codfw.wmnet' (T356287) [21:43:54] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [21:44:42] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet2006-dev.codfw.wmnet' (T356287) [21:52:54] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet2006-dev.codfw.wmnet' (T356287) [21:53:02] T356287: Upgrade cloud-vps openstack to version 'Bobcat' - https://phabricator.wikimedia.org/T356287 [21:54:29] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720683 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2009-... [22:15:52] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720737 (10Papaul) [22:25:15] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720760 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcontrol2... [22:50:03] (ToolforgeKubernetesWorkerTooManyDProcesses) firing: Kubernetes worker tools-k8s-worker-nfs-42 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:07:12] 10cloud-services-team (Hardware), 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q3:rack/setup/install cloudcontrol2009-dev.codfw.wmnet - https://phabricator.wikimedia.org/T354896#9720868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcontrol2009-... [23:15:03] (ToolforgeKubernetesWorkerTooManyDProcesses) resolved: Kubernetes worker tools-k8s-worker-nfs-42 has many processes stuck on IO (probably NFS) - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:24:00] (OpenstackAPIResponse) firing: Openstack API average response time is too high. - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/OpenstackAPIResponse - https://grafana.wikimedia.org/d/UUmLqqX4k - https://alerts.wikimedia.org/?q=alertname%3DOpenstackAPIResponse [23:58:50] 10Tools, 05Community-Wishlist-Survey-2023, 03Wikimedia Wishathon: Investigate Dabfix tool implementation - https://phabricator.wikimedia.org/T336545#9720973 (10srishakatux) @Soda @Gopavasanth Since both of you worked on this tool during the Wishathon, could you please share more updates about the progress on...