[00:42:25] 10tool-wscontest, 07good first task: Add sortable column for WSContest contest page - https://phabricator.wikimedia.org/T331509#10451868 (10Samwilson) 05Open→03Resolved Merged. [00:43:20] 10tool-wscontest, 07Accessibility, 07Voice & Tone: [[Wikimedia:Wscontest-click-here-link/en]] accessibility issue - https://phabricator.wikimedia.org/T367634#10451870 (10Samwilson) 05Open→03Resolved [02:50:41] FIRING: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:19:52] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [04:21:44] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29349 bytes in 1.138 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [04:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 2 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [05:29:26] (03PS1) 10Andrew Bogott: Re-enable the network and port panel for instance creation [openstack/horizon/horizon] (2024.1) - 10https://gerrit.wikimedia.org/r/1110464 (https://phabricator.wikimedia.org/T380081) [05:29:28] (03PS1) 10Andrew Bogott: launch-instance-model: support default network ID in the network panel [openstack/horizon/horizon] (2024.1) - 10https://gerrit.wikimedia.org/r/1110465 (https://phabricator.wikimedia.org/T380081) [05:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:37:33] 10tool-wscontest: Add health-check-script for scores command runner - https://phabricator.wikimedia.org/T383304#10451996 (10Samwilson) 05Open→03Resolved The score command failed just now with "SQLSTATE[HY000]: General error: 2006 MySQL server has gone away", but it was correctly restarted and everything... [07:54:36] 06cloud-services-team, 10Toolforge: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.31 - https://phabricator.wikimedia.org/T372697#10452076 (10Slst2020) a:05Slst2020→03None [08:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [08:58:01] 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#10452154 (10Fnielsen) Some of the so-called open LLMs have a questionable license clause about non-competition. For instance, the LLAMA license https://github.com/meta-llama/llama/blob/main/LICENSE has... [09:30:00] 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#10452207 (10Slst2020) @Isaac I know you are following this space quite closely – any new thoughts since your comment from 2023? [10:11:37] 10wikitech.wikimedia.org, 06Growth-Team, 10Notifications, 06serviceops, and 2 others: Wikitech notifications failing to load cross-wiki - https://phabricator.wikimedia.org/T376305#10452427 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [10:12:15] (03approved) 10dcaro: cli: Improve deploy-token command UX and safety [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/6 (https://phabricator.wikimedia.org/T380706) (owner: 10sstefanova) [10:28:15] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10452542 (10Slst2020) So I have a Wikitech account (Slavina Stefanova) that is not linked to any SUL account right now. I tried to create a SUL account with the same name, but am getting the... [10:48:46] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10452640 (10Ladsgroup) I can rename Slst2020@wikitech to another username (e.g. Slst2020 (usurped)@wikitech) and then rename Slavina Stefanova to Slst2020 in wikitech. Would that work for you? [10:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [10:52:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-50 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [10:56:49] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10452675 (10Slst2020) >>! In T376267#10452640, @Ladsgroup wrote: > I can rename Slst2020@wikitech to another username (e.g. Slst2020 (usurped)@wikitech) and then rename Slavina Stefanova to S... [11:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [11:08:24] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10452728 (10Ladsgroup) >>! In T376267#10452675, @Slst2020 wrote: >>>! In T376267#10452640, @Ladsgroup wrote: >> I can rename Slst2020@wikitech to another username (e.g. Slst2020 (usurped)@wik... [11:17:26] 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#10452779 (10mfossati) >>! In T336905#10449455, @valerio.bozzolan wrote: >> How much disk, RAM, CPU might be needed? Can we meet those needs with our existing hardware? >>Are GPUs required? If so, how ma... [11:38:29] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:43:29] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [11:46:45] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10452939 (10Slst2020) Thanks @Ladsgroup – everything seems to work as expected (successfully logged in with SUL, wikitech, gitlab), at least on Chrome with my personal profile. On the Wikime... [12:02:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-50 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [12:21:38] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1110762 (owner: 10L10n-bot) [12:44:21] 06cloud-services-team, 10Horizon: Horizon: obsessive redirects during logins - https://phabricator.wikimedia.org/T383370#10453066 (10dcaro) I've seen a similar behavior with horizon.wikimedia.org, it has not gotten to minutes yet though, but ~20s for sure. [13:07:23] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [labs/tools/commons-mass-description] - 10https://gerrit.wikimedia.org/r/1110762 (owner: 10L10n-bot) [13:19:20] (03update) 10sstefanova: cli: Improve deploy-token command UX and safety [repos/cloud/toolforge/components-cli] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-cli/-/merge_requests/6 (https://phabricator.wikimedia.org/T380706) [13:29:57] (03update) 10sstefanova: functional tests: add components-api tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/631 (https://phabricator.wikimedia.org/T379092) [13:39:07] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [openstack] 2025-01-08 nova-api-metadata.service down on cloudcontrol1005 - https://phabricator.wikimedia.org/T383203#10453204 (10dcaro) 05Open→03Resolved >>! In T383203#10445... [13:42:45] (03approved) 10dcaro: [toolforge-deploy] add more test cases to job loads [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/646 (https://phabricator.wikimedia.org/T364204) (owner: 10raymond-ndibe) [13:42:52] (03approved) 10dcaro: [jobs-api] replicas default to 1 in NewJob model [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T364204) (owner: 10raymond-ndibe) [13:55:13] 06cloud-services-team, 10Cloud-VPS: CSI Cinder issues causing periodic failures on Magnum cluster - https://phabricator.wikimedia.org/T383560 (10Proc) 03NEW [13:56:24] 06cloud-services-team, 10Cloud-VPS: CSI Cinder issues causing periodic failures on Magnum cluster - https://phabricator.wikimedia.org/T383560#10453254 (10Proc) [14:00:02] 06cloud-services-team, 10Openstack-Magnum: CSI Cinder issues causing periodic failures on Magnum cluster - https://phabricator.wikimedia.org/T383560#10453262 (10taavi) [14:14:03] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238#10453291 (10dcaro) I finished up rebalancing the ceph node the morning after, wh... [14:14:10] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Unplanned: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238#10453293 (10dcaro) 05In progress→03Resolved [14:17:46] (03approved) 10dcaro: [maintain-harbor] get_example_config() return content of .env file [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/41 (owner: 10raymond-ndibe) [14:34:29] 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#10453409 (10Isaac) > Their FAQ includes a section on known compliant systems. It lists 5 that have passed their Validation phase of analysis: > ... > I take this to mean that there are currently no syst... [14:54:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:55:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10453508 (10fnegri) 05Declined→03Open Reopening after discussing with @joanna_borun and the rest of the WMCS team. Whi... [14:55:45] 10cloud-services-team (FY2024/2025-Q3-Q4), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10453510 (10fnegri) [15:00:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-7 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:04:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:05:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-7 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [15:11:48] 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#10453607 (10Huji) May I offer a different perspective? While it is pretty clear that we want "programs" run on WMCS to meet OSI requirements, it doesn't have to be the case that the AI model itself woul... [15:28:27] (03update) 10dcaro: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) (owner: 10raymond-ndibe) [15:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [15:59:00] !log dcaro@urcuchillay admin START - Cookbook wmcs.ceph.roll_restart_osd_daemons [15:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [16:01:26] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Remove hardcoded NFT rules related to PAWS workers - https://phabricator.wikimedia.org/T383261#10454014 (10fnegri) [16:01:29] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: Kernel error metrics have overlapping definitions - https://phabricator.wikimedia.org/T382961#10454016 (10fnegri) [16:01:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: [wmcs-cookbooks] wmcs.openstack.cloudvirt.vm_console cookbook is not working from cloudcumin hosts - https://phabricator.wikimedia.org/T379570#10454018 (10fnegri) [16:01:34] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Data-Services: tofu-infra: replace wmcs-wikireplica-dns.py with tofu - https://phabricator.wikimedia.org/T374953#10454022 (10fnegri) [16:01:37] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: tofu-infra: refactor repo structure - https://phabricator.wikimedia.org/T375283#10454020 (10fnegri) [16:06:09] FIRING: CephClusterInWarning: Ceph cluster in eqiad is in warning status - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/CephClusterInWarning - https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&search=open&tag=ceph&tag=health&tag=WMCS - https://alerts.wikimedia.org/?q=alertname%3DCephClusterInWarning [16:19:47] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10454098 (10JJMC89) [16:31:55] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17): Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#10454212 (10fnegri) 05In progress→03Stalled [16:34:42] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: horizon: enable the UI to select networks on VM creation panel - https://phabricator.wikimedia.org/T380081#10454228 (10taavi) What use case there is for the manual port selection form? If there is none, I think only re-enabling the network for... [16:43:23] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 (10Andrew) 03NEW [16:43:44] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10454278 (10Andrew) p:05Triage→03High [16:50:36] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10454330 (10dcaro) Originally, the VM was in `ERROR` state, and was showing the log: ` | fault | {'code': 500, 'created': '2025-01-13T02:31:06Z',... [16:53:58] (03CR) 10Majavah: [C:03+2] Convert tool metadata to Codex tables [labs/striker] - 10https://gerrit.wikimedia.org/r/1108153 (https://phabricator.wikimedia.org/T380114) (owner: 10Majavah) [16:55:24] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10454344 (10Andrew) Many VMs have the old IPs and one or more of the new ones. Those VMs don't seem to be in danger. That leaves only VMs without IPs of new mons. ` mysql:roo... [17:23:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:25:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:29:23] FIRING: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-7 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [17:30:03] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-16 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [17:33:23] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10454613 (10Andrew) For my tests, I'm experimenting on the little-used 9da6e185-0068-4bf0-9fcf-56440625d285/paws-puppetserver-1: ` mysql:root@localhost [nova_eqiad1]> select... [17:41:29] 10Cloud-Services: Block crawlers on cyberbot project - https://phabricator.wikimedia.org/T383592#10454649 (10Cyberpower678) a:03Andrew The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more sp... [17:43:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [17:43:50] 06cloud-services-team, 10Cloud-VPS, 10InternetArchiveBot: Block crawlers on cyberbot project - https://phabricator.wikimedia.org/T383592#10454679 (10bd808) [17:44:22] 06cloud-services-team, 10Cloud-VPS, 10InternetArchiveBot: Block crawlers on cyberbot project (iabot.wmcloud.org) - https://phabricator.wikimedia.org/T383592#10454687 (10bd808) [17:44:23] RESOLVED: ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-7 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [17:46:30] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [17:46:55] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10454718 (10Andrew) I tried migrating a host 3b85fc66-ff29-486b-9eed-1c6893a4fc40/metricsinfra-puppetserver-1 before anything was wrong with it, and it seems to have corrected... [17:52:33] (03approved) 10raymond-ndibe: [maintain-harbor] get_example_config() return content of .env file [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/41 [17:52:40] (03merge) 10raymond-ndibe: [maintain-harbor] get_example_config() return content of .env file [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/41 [17:53:25] (03approved) 10raymond-ndibe: [jobs-api] replicas default to 1 in NewJob model [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T364204) [17:53:34] (03merge) 10raymond-ndibe: [jobs-api] replicas default to 1 in NewJob model [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/132 (https://phabricator.wikimedia.org/T364204) [17:53:54] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10454799 (10JJMC89) [17:55:17] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: maintain-harbor: bump to 0.0.20-20250113175254-7d5dce92 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/654 [17:56:33] (03open) 10group_203_bot_4866fc124f4b41659f667468a6115cf3: jobs-api: bump to 0.0.345-20250113175346-77c98100 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/655 (https://phabricator.wikimedia.org/T364204) [17:57:19] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10454831 (10Andrew) cold migration seems to resolve the issue without problem (other than system reboot). live migration seems to get things stuck in 'migration' state, even t... [18:01:42] (03update) 10raymond-ndibe: [toolforge-deploy] add more test cases to job loads [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/646 (https://phabricator.wikimedia.org/T364204) [18:01:44] (03update) 10raymond-ndibe: [toolforge-deploy] add more test cases to job loads [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/646 (https://phabricator.wikimedia.org/T364204) [18:02:23] (03update) 10raymond-ndibe: [jobs-api] convert all quotas to appropriate units [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/119 (https://phabricator.wikimedia.org/T361120) [18:02:50] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [18:06:54] 10Tool-wikiqanda, 06Future-Audiences: Data collection for external release - https://phabricator.wikimedia.org/T380780#10454943 (10etz) a:03etz [18:12:25] 10Tool-wikiqanda, 06Future-Audiences: Refactoring btw Slack & Discord - https://phabricator.wikimedia.org/T381795#10454965 (10DLin-WMF) 05Open→03Resolved [18:12:31] 10Tool-wikiqanda, 06Future-Audiences: Add provenance parameters to bot links - https://phabricator.wikimedia.org/T382019#10454969 (10DLin-WMF) 05Open→03Resolved a:03DLin-WMF [18:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [18:39:27] (03PS1) 10Majavah: templates: Fix some unnecessary margins [labs/striker] - 10https://gerrit.wikimedia.org/r/1110831 [18:39:27] (03PS1) 10Majavah: templates: Link tool maintainers to tool pages [labs/striker] - 10https://gerrit.wikimedia.org/r/1110832 [18:41:23] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10455124 (10Arnoldokoth) |**Wikitech account/LDAP:**| AOkoth | |**SUL account**| AOkoth (WMF) | |**Account linked on [[ https://idm.wikimedia.org/ | IDM ]]** |Y| |**I have visited [[ https://... [19:32:31] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10455338 (10Andrew) Also this seems to work w/out a reboot: ` openstack server migrate --shared-migration --wait 871ab13f-51df-4bc8-917f-0828ac98b3c1 ` I'm doing that to all... [19:43:58] FIRING: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [19:44:11] 06cloud-services-team: MetricsinfraAlertmanagerDown Metricsinfra alertmanager is unreachable # page - https://phabricator.wikimedia.org/T383616 (10phaultfinder) 03NEW [19:53:59] RESOLVED: [2x] MetricsinfraAlertmanagerDown: Metricsinfra alertmanager is unreachable #page - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/MetricsinfraAlertmanagerDown - TODO - https://alerts.wikimedia.org/?q=alertname%3DMetricsinfraAlertmanagerDown [19:54:50] FIRING: [3x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [19:54:57] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 324 bytes in 60.007 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [19:58:53] FIRING: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [19:59:50] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:08:53] RESOLVED: ToolsNFSDown: No tools nfs services running found - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsNFSDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolsNFSDown [20:14:21] FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [20:22:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-3 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:22:54] FIRING: PuppetAgentNoResources: No Puppet resources found on instance enc-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:22:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:23:47] 06cloud-services-team, 10Toolforge: All Toolforge tool folders (and almost all user folders) disappeared - https://phabricator.wikimedia.org/T383623 (10MBH) 03NEW [20:23:54] FIRING: PuppetAgentNoResources: No Puppet resources found on instance runner-1029 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:24:54] FIRING: PuppetAgentNoResources: No Puppet resources found on instance proxy-04 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:27:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance enc-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:27:54] FIRING: [6x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-cumin-1 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:27:54] FIRING: [15x] PuppetAgentNoResources: No Puppet resources found on instance tools-docker-registry-8 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:29:50] FIRING: [5x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_admin_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:29:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance maps-proxy-03 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:32:54] FIRING: [32x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:32:54] FIRING: [16x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:33:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance runner-1027 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:36:55] 06cloud-services-team, 10Toolforge: All Toolforge tool folders (and almost all user folders) disappeared - https://phabricator.wikimedia.org/T383623#10455701 (10MBH) 05Open→03Resolved a:03MBH Looks like fixed. [20:37:07] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-72 [20:37:21] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.806 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:37:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance enc-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:37:54] FIRING: [47x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:37:54] FIRING: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:38:28] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-1 (T383238) [20:38:34] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [20:39:02] 06cloud-services-team, 10Toolforge: All Toolforge tool folders (and almost all user folders) disappeared - https://phabricator.wikimedia.org/T383623#10455739 (10Don-vip) Yes it was discussed and fixed on IRC: ` [21:33:09] dcaro: sorry, force restart what? [21:33:11] the whole... [20:39:25] 06cloud-services-team, 10Cloud-VPS: tools nfs outage - https://phabricator.wikimedia.org/T383625 (10Andrew) 03NEW [20:39:50] RESOLVED: [2x] ProbeDown: Service tools-k8s-haproxy-5:30000 has failed probes (http_this_tool_does_not_exist_toolforge_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://prometheus-alerts.wmcloud.org/?q=alertname%3DProbeDown [20:39:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance maps-proxy-03 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:40:17] 06cloud-services-team, 10Toolforge: All Toolforge tool folders (and almost all user folders) disappeared - https://phabricator.wikimedia.org/T383623#10455759 (10fnegri) [20:40:22] 06cloud-services-team, 10Cloud-VPS: tools nfs outage - https://phabricator.wikimedia.org/T383625#10455760 (10fnegri) [20:41:21] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-72 [20:41:43] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-8 [20:42:02] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-1 (T383238) [20:42:07] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-20 (T383238) [20:42:11] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-20 (T383238) [20:42:54] FIRING: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:42:54] FIRING: [46x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:42:54] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance enc-1 on project cloudinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:42:57] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-20 (T383625) [20:43:01] !log andrew@cloudcumin1001 tools END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-20 (T383625) [20:43:02] T383625: tools nfs outage - https://phabricator.wikimedia.org/T383625 [20:43:31] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-21 (T383625) [20:43:44] 06cloud-services-team, 10Cloud-VPS: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625#10455779 (10dcaro) [20:43:54] FIRING: [2x] PuppetAgentNoResources: No Puppet resources found on instance runner-1027 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:44:54] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance maps-proxy-03 on project project-proxy - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:47:05] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-8 [20:47:51] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-58 (T383238) [20:47:54] FIRING: [45x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:47:54] FIRING: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:47:54] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [20:48:53] FIRING: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-prometheus-3 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:48:54] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance runner-1027 on project gitlab-runners - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:48:54] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-21 (T383625) [20:48:58] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [20:52:04] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-36 (T383625) [20:52:54] FIRING: [45x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:52:54] RESOLVED: [18x] PuppetAgentNoResources: No Puppet resources found on instance toolsbeta-acme-chief-2 on project toolsbeta - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:53:13] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-58 (T383238) [20:53:14] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16 (T383238) [20:53:16] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [20:53:51] RESOLVED: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown [20:53:53] RESOLVED: PuppetAgentNoResources: No Puppet resources found on instance metricsinfra-prometheus-3 on project metricsinfra - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [20:54:03] FIRING: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-72 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [20:56:33] FIRING: [5x] ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-36 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcess [20:57:28] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-36 (T383625) [20:57:32] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [20:58:37] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16 (T383238) [20:58:38] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-13 (T383238) [20:58:40] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [21:01:08] FIRING: [15x] PuppetAgentNoResources: No Puppet resources found on instance tools-acme-chief-4 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:03:31] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-13 (T383238) [21:03:32] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-35 (T383238) [21:05:55] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-45 (T383625) [21:05:58] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [21:08:20] 10Tool-video-answer-tool, 06Future-Audiences: [Video tool] Create UI for small image layouts - https://phabricator.wikimedia.org/T379646#10455890 (10etz) 05Open→03Resolved [21:08:50] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10455891 (10Andrew) >>! In T383583#10455338, @Andrew wrote: > Also this seems to work w/out a reboot: > > > ` > openstack server migrate --shared-migration --wait 871ab13f-51... [21:08:53] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-35 (T383238) [21:08:54] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-2 (T383238) [21:08:57] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [21:10:40] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-45 (T383625) [21:13:02] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T383238) [21:14:04] !log andrew@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-75 (T383238) [21:14:09] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [21:14:14] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-2 (T383238) [21:14:15] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-21 (T383238) [21:14:17] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T383625) [21:14:23] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [21:15:21] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-75 (T383625) [21:16:08] FIRING: [4x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-19 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:18:36] !log andrew@cloudcumin1001 tools END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-21 (T383238) [21:18:44] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-21 (T383625) [21:19:32] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-21 (T383625) [21:19:37] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [21:20:13] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-74 (T383625) [21:21:08] FIRING: [3x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-19 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:24:36] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-19 (T383238) [21:24:40] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [21:24:54] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-74 (T383625) [21:24:57] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [21:25:33] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-38 (T383625) [21:29:56] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-19 (T383238) [21:30:00] T383238: [nfs] 2025-01-08 tools-nfs outage - https://phabricator.wikimedia.org/T383238 [21:30:13] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-38 (T383625) [21:30:18] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [21:31:03] !log andrew@cloudcumin1001 tools START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-47 (T383625) [21:35:45] !log andrew@cloudcumin1001 tools END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-47 (T383625) [21:35:50] T383625: [2025-01-13] tools nfs outage - https://phabricator.wikimedia.org/T383625 [21:36:08] RESOLVED: [2x] PuppetAgentNoResources: No Puppet resources found on instance tools-k8s-worker-nfs-19 on project tools - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentNoResources [21:40:21] 06cloud-services-team, 10Cloud-VPS, 07IPv6, 13Patch-For-Review: horizon: enable the UI to select networks on VM creation panel - https://phabricator.wikimedia.org/T380081#10456066 (10Andrew) >>! In T380081#10454228, @taavi wrote: > What use case there is for the manual port selection form? If there is none... [21:44:42] 06cloud-services-team, 10Cloud-VPS, 10InternetArchiveBot: Block crawlers on cyberbot project (iabot.wmcloud.org) - https://phabricator.wikimedia.org/T383592#10456097 (10Peachey88) [21:49:34] (03open) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [21:50:32] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [22:25:33] RESOLVED: ToolforgeKubernetesWorkerTooManyDProcesses: Node tools-k8s-worker-nfs-74 has at least 12 procs in D state, and may be having NFS/IO issues - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesWorkerTooManyDProcesses - https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses [23:02:51] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [23:04:53] (03approved) 10raymond-ndibe: [toolforge-deploy] add more test cases to job loads [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/646 (https://phabricator.wikimedia.org/T364204) [23:04:58] (03merge) 10raymond-ndibe: [toolforge-deploy] add more test cases to job loads [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/646 (https://phabricator.wikimedia.org/T364204) [23:05:11] (03update) 10raymond-ndibe: [toolforge-deploy] add maintain-harbor image retention tests [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/648 [23:07:07] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [23:10:29] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [23:16:59] 06cloud-services-team, 10Toolforge, 07Epic: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge tools - https://phabricator.wikimedia.org/T127367#10456392 (10bd808) [23:27:46] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [23:57:42] 06cloud-services-team, 10Cloud-VPS, 10InternetArchiveBot: Block crawlers on cyberbot project (iabot.wmcloud.org) - https://phabricator.wikimedia.org/T383592#10456511 (10bd808) `lang=irc [18:02] any chance you can log more of the headers on your end? Or is that really all you get? [18:03] <...