[00:09:06] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [00:17:58] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [00:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [00:53:36] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [01:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [01:08:58] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [02:09:52] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10456692 (10Andrew) As far as I can tell, a cold migration is the only reliable way to repair this. Repairing the connection_info in the database makes it possible to reboot th... [02:19:45] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [02:30:13] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10456700 (10Andrew) I'm sitting on the following email which I don't love but which is probably needed: > Due to a latent configuration error, many VMs need to be rebooted.... [03:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:05:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [04:13:54] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [04:18:19] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [04:28:08] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [05:01:16] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [05:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [06:00:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:24:29] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10457009 (10dcaro) Hmm... I wonder where is it storing that data to do the live migration, maybe it reads the xml from libvirt? If so, editing that xml would work? (can that be... [09:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [09:58:41] 10Toolforge (Toolforge iteration 17): [builds-cli,builds-api] `build quota` fails if tool has no builds - https://phabricator.wikimedia.org/T353701#10457110 (10Slst2020) a:05Slst2020→03None [09:59:15] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: [toolforge-cli,jobs-cli,builds-cli,envvars-cli] Explore OpenAPI SDK tooling for client consolidation - https://phabricator.wikimedia.org/T356261#10457113 (10Slst2020) a:05Slst2020→03None [09:59:22] 10wikitech.wikimedia.org: Wikitech displays desktop site on mobile devices - https://phabricator.wikimedia.org/T383656 (10KBach) 03NEW [09:59:38] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypotesis] 6.3.5 Develop the sustainability score - https://phabricator.wikimedia.org/T376896#10457125 (10Slst2020) a:05Slst2020→03None [10:00:34] 06cloud-services-team, 10Toolforge, 07Epic: [jobs-cli,builds-cli,toolforge-cli,webservice] Consolidate the Toolforge CLIs - https://phabricator.wikimedia.org/T356262#10457128 (10Slst2020) a:05Slst2020→03None [11:04:35] (03update) 10dcaro: scheduled jobs: add timeout option [repos/cloud/toolforge/jobs-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/129 (https://phabricator.wikimedia.org/T306391) [11:17:18] (03update) 10dcaro: jobs-api: bump to 0.0.345-20250113175346-77c98100 [repos/cloud/toolforge/toolforge-deploy] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/655 (https://phabricator.wikimedia.org/T364204) (owner: 10group_203_bot_4866fc124f4b41659f667468a6115cf3) [11:25:38] 06cloud-services-team, 10Cloud-VPS, 10VPS-Projects: Wikidocumentaries wiki is VERY slow - https://phabricator.wikimedia.org/T223378#10457359 (10PixDeVl) 05Open→03Resolved Seems to be fine -some 5 years later-. [11:46:10] 10cloud-services-team (FY2024/2025-Q3-Q4), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10457419 (10joanna_borun) Approved [11:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:05:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [12:38:23] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10457575 (10Ladsgroup) @Arnoldokoth I renamed your wikitech account to `AOkoth (WMF)` to match SUL username. Please, try logging in with the new username (and password of AOkoth, if it doesn'... [12:53:02] (03update) 10sstefanova: deploy-token: prevent accidental token overwrites [repos/cloud/toolforge/components-api] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/merge_requests/49 [13:43:35] 10cloud-services-team (FY2024/2025-Q1-Q2), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypothesis] WE6.3.4 If we enable the automatic deployment of a minimal tool, we will be able to evaluate the end to end flow and set the groundwork for adding support ... - https://phabricator.wikimedia.org/T375199#10457746 [14:03:04] 10cloud-services-team (FY2024/2025-Q3-Q4), 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add permissions for Komla to run WMCS cookbooks - https://phabricator.wikimedia.org/T379159#10457879 (10fnegri) 05Open→03Resolved [14:07:20] 06cloud-services-team, 10Toolforge, 10Temporary accounts, 06Trust and Safety Product Team: Check impact of Temporary Accounts on Toolforge tools - https://phabricator.wikimedia.org/T378516#10457895 (10fnegri) This wiki page was created to track and review tools that are impacted: https://www.mediawiki.org/... [14:10:01] 06cloud-services-team, 10Toolforge, 10Temporary accounts, 06Trust and Safety Product Team: Check impact of Temporary Accounts on Toolforge tools - https://phabricator.wikimedia.org/T378516#10457900 (10fnegri) [14:20:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:24:17] 06cloud-services-team, 10Toolforge, 10Temporary accounts, 06Trust and Safety Product Team: Check impact of Temporary Accounts on Toolforge tools - https://phabricator.wikimedia.org/T378516#10457994 (10fnegri) @sgrabarczuk I added a link to this task at https://www.mediawiki.org/wiki/Trust_and_Safety_Produc... [14:24:40] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [14:30:41] RESOLVED: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [14:32:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:37:06] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [components-api] First iteration of the component API - https://phabricator.wikimedia.org/T362051#10458041 (10fnegri) [14:37:09] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 07Epic: [Hypotesis] 6.3.5 Develop the sustainability score - https://phabricator.wikimedia.org/T376896#10458043 (10fnegri) [14:37:11] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [cloudceph] Improve downtime when a switch goes down - https://phabricator.wikimedia.org/T375204#10458047 (10fnegri) [14:37:13] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: [components-api] Add functional tests for the components api - https://phabricator.wikimedia.org/T379092#10458045 (10fnegri) [14:37:15] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [wmcs-cookbooks] changes to openstack cli / auth things broke several cookbooks - https://phabricator.wikimedia.org/T346427#10458051 (10fnegri) [14:37:17] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10458049 (10fnegri) [14:37:20] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge: [tools.meta] can't delete file inside cache/wikimedia-wikis.dat - https://phabricator.wikimedia.org/T357098#10458054 (10fnegri) [14:37:22] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338#10458058 (10fnegri) [14:37:23] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Alert, 07Cloud-Services-Worktype-Maintenance: [cloudceph] Slow operations - tracking task - https://phabricator.wikimedia.org/T334240#10458060 (10fnegri) [14:37:27] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Toolforge, 10Observability-Alerting, 05Goal: Move WMCS off of Icinga and introduce alertmanager - https://phabricator.wikimedia.org/T328502#10458056 (10fnegri) [14:37:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [14:37:31] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS (Debian Buster Deprecation), 10Toolforge (Toolforge iteration 17), 07Epic, 05Goal: Toolforge: migrate to Debian Bullseye or later - https://phabricator.wikimedia.org/T311897#10458064 (10fnegri) [14:45:58] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Maintenance: [ceph] export number of bad sectors per-disk - https://phabricator.wikimedia.org/T348716#10458103 (10fnegri) [14:47:20] 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10Puppet-Core: Normalise hiera default values - https://phabricator.wikimedia.org/T289665#10458106 (10fnegri) 05In progress→03Open [14:51:29] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17), 05Goal, 13Patch-For-Review: [infra] Decommission the Grid Engine infrastructure - https://phabricator.wikimedia.org/T314664#10458124 (10fnegri) [14:51:49] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): Intermittent redis connection timeouts in Toolforge - https://phabricator.wikimedia.org/T318479#10458129 (10fnegri) [15:01:25] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10458166 (10Andrew) editing the xml file does not seem to make a difference, much to my surprise [15:14:20] 10Toolforge (Toolforge iteration 17), 13Patch-For-Review: Persist maintain-harbor logs - https://phabricator.wikimedia.org/T383081#10458210 (10taavi) Did you consider using `successfulJobsHistoryLimit` and `failedJobsHistoryLimit` to persist pod objects and the logs they include for some amount of time? [15:14:32] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10458212 (10Arnoldokoth) @Ladsgroup Hmm. I tried logging in and it failed. So I did the password reset. Though this https://wikitech.wikimedia.org/wiki/Special:MergeAccount fails due to a pas... [15:16:14] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10458223 (10taavi) >>! In T376267#10458212, @Arnoldokoth wrote: > @Ladsgroup Hmm. I tried logging in and it failed. So I did the password reset. Though this https://wikitech.wikimedia.org/wik... [15:20:38] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: openstack: updates to horizon for vxlan migration - https://phabricator.wikimedia.org/T374824#10458244 (10fnegri) [15:20:39] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: Migrate Cloud VPS instances to VXLAN based networks - https://phabricator.wikimedia.org/T364725#10458246 (10fnegri) [15:20:41] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up - https://phabricator.wikimedia.org/T358774#10458250 (10fnegri) [15:20:43] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 13Patch-For-Review: cloudgw: add cloud-private subnet support - https://phabricator.wikimedia.org/T338334#10458248 (10fnegri) [15:20:47] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Cumin, 06Infrastructure-Foundations, 13Patch-For-Review: [cumin] [openstack] Openstack backend fails when project is not set - https://phabricator.wikimedia.org/T346453#10458252 (10fnegri) [15:28:21] 10Tool-ducttape: Delete stage should delete web proxy created in the configure stage - https://phabricator.wikimedia.org/T334701#10458308 (10SDunlap) 05Open→03Invalid [15:35:15] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [15:35:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:36:24] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [15:40:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [15:41:42] 10cloud-services-team (Hardware), 10Cloud-VPS, 06DC-Ops, 05Cloud-Services-Origin-Team, 07Cloud-Services-Worktype-Project: [cookbooks.ceph] create a script to get the list of rbd images affected by stuck/inactive PGs - https://phabricator.wikimedia.org/T331636#10458410 (10dcaro) a:05dcaro→03None I did... [15:42:10] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [15:51:39] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [15:56:45] 06cloud-services-team, 10Toolforge: Toolforge jobs: increased exit code 137 rate since 2024-12-14 - https://phabricator.wikimedia.org/T382865#10458533 (10JJMC89) [15:57:30] 06cloud-services-team, 10Cloud-VPS: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583#10458547 (10fnegri) > editing the xml file does not seem to make a difference, much to my surprise Is it possible that OpenStack cached the old value somewhere? Have you tried... [15:58:01] 06cloud-services-team, 10Cloud-VPS, 10SRE Observability (FY2024/2025-Q3): Remove librenms -> graphite integration, replace with gnmi - https://phabricator.wikimedia.org/T372457#10458564 (10lmata) [16:01:11] 06cloud-services-team, 10SRE Observability (FY2024/2025-Q3): cloud: prometheus: investigate weirdness with metrics and alertmanager - https://phabricator.wikimedia.org/T374599#10458616 (10lmata) [16:04:23] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [16:17:57] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [16:25:09] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10458792 (10Arnoldokoth) @taavi Yes, it works now. That's the password I needed. Thank you. [18:06:43] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Toolforge (Toolforge iteration 17): [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 - https://phabricator.wikimedia.org/T362868#10459440 (10fnegri) [18:08:32] !log dcaro@urcuchillay admin END (PASS) - Cookbook wmcs.ceph.roll_restart_osd_daemons (exit_code=0) [18:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [18:08:49] (03update) 10raymond-ndibe: [maintain-harbor] persist log [repos/cloud/toolforge/maintain-harbor] - 10https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/42 (https://phabricator.wikimedia.org/T383081) [18:31:18] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on cloudcephosd1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:48:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:52:35] (03CR) 10Majavah: [C:03+2] templates: Fix some unnecessary margins [labs/striker] - 10https://gerrit.wikimedia.org/r/1110831 (owner: 10Majavah) [18:52:42] (03CR) 10Majavah: [C:03+2] templates: Link tool maintainers to tool pages [labs/striker] - 10https://gerrit.wikimedia.org/r/1110832 (owner: 10Majavah) [18:53:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [18:54:57] (03Merged) 10jenkins-bot: templates: Fix some unnecessary margins [labs/striker] - 10https://gerrit.wikimedia.org/r/1110831 (owner: 10Majavah) [18:55:27] (03Merged) 10jenkins-bot: templates: Link tool maintainers to tool pages [labs/striker] - 10https://gerrit.wikimedia.org/r/1110832 (owner: 10Majavah) [19:24:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [19:24:24] 06cloud-services-team: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723 (10phaultfinder) 03NEW [19:29:11] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [19:47:28] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:52:28] RESOLVED: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [19:55:35] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10460122 (10Don-vip) I can't login on wikitech and don't understand what I need to do, can anyone please help me? |**Wikitech account/LDAP:**| Don-vip| |**SUL account**| Don-vip| |**Account... [20:15:49] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10460231 (10Ladsgroup) I just force attached your account given that it's clear (connection in phabricator) both belong to the same person. [20:16:43] FIRING: InstanceDown: Project cvn instance cvn-app12 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:16:47] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-33 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:17:43] FIRING: [2x] InstanceDown: Project cloudinfra instance cloudinfra-idp-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:21:43] FIRING: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-33 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:21:43] RESOLVED: InstanceDown: Project cvn instance cvn-app12 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:22:23] FIRING: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-33 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:22:43] RESOLVED: [2x] InstanceDown: Project cloudinfra instance cloudinfra-idp-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:24:13] FIRING: [3x] InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:26:11] FIRING: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [20:26:43] RESOLVED: [2x] InstanceDown: Project tools instance tools-k8s-worker-nfs-33 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:26:59] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [20:27:23] FIRING: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-33 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:42:23] RESOLVED: [2x] ToolforgeKubernetesNodeNotReady: Kubernetes node tools-k8s-worker-nfs-33 is not ready - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesNodeNotReady - https://grafana.wmcloud.org/d/8GiwHDL4k/kubernetes-cluster-overview?orgId=1 - https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesNodeNotReady [20:43:48] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1055.eqiad.wmnet' (T383583) [20:43:54] T383583: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 [20:49:13] RESOLVED: InstanceDown: Project cloudinfra instance cloudinfra-cloudvps-puppetserver-1 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [20:53:58] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1055.eqiad.wmnet' (T383583) [20:54:05] T383583: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 [20:54:55] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1055.eqiad.wmnet}' (T383583) [20:55:37] !log andrew@cloudcumin1001 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1055.eqiad.wmnet}' (T383583) [20:59:17] PROBLEM - Host cloudvirt1055 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:51] RECOVERY - Host cloudvirt1055 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [21:06:37] 10wikitech.wikimedia.org: ☂ Wikitech account linking and SUL error reporting - https://phabricator.wikimedia.org/T376267#10460389 (10Don-vip) It works now, thank you @Ladsgroup! [21:26:32] !log andrew@cloudcumin1001 admin START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T383583) [21:26:39] T383583: VM nova records attached to incorrect cloudcephmon IPs - https://phabricator.wikimedia.org/T383583 [21:26:41] !log andrew@cloudcumin1001 admin END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T383583) [21:27:14] FIRING: KernelError: Server cloudvirt1055 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1055 - https://alerts.wikimedia.org/?q=alertname%3DKernelError [21:27:14] FIRING: KernelWarning: Server cloudvirt1055 may have kernel errors - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Kernel_panic - https://grafana.wikimedia.org/d/b013af4c-d405-4d9f-85d4-985abb3dec0c/wmcs-kernel-panic-detector?orgId=1&var-instance=cloudvirt1055 - https://alerts.wikimedia.org/?q=alertname%3DKernelWarning [21:27:26] 06cloud-services-team: KernelError Server cloudvirt1055 may have kernel errors - https://phabricator.wikimedia.org/T383739 (10phaultfinder) 03NEW [21:32:41] RESOLVED: Temperature: Inlet Temp issue on clouddumps1001:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=clouddumps1001 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [21:42:14] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS, 10Ceph, 06DC-Ops, and 2 others: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10460491 (10wiki_willy) Hi @dcaro - because this was taking so long, I escalated this up to our account team again l... [21:57:05] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [21:59:39] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338#10460533 (10cmooney) >>! In T341338#10033899, @Andrew wrote: > @cmooney can you advise what (if anything) needs doing here? Somehow had missed this one. Em yeah D... [22:08:35] 10cloud-services-team (FY2024/2025-Q3-Q4), 10Cloud-VPS: eqiad1: fix PTR delegations for 185.15.56.0/24 - https://phabricator.wikimedia.org/T341338#10460548 (10cmooney) >>! In T341338#9961545, @Andrew wrote: > It's not clear to me that I can delete 56.15.185.in-addr.arpa. while 0-25.56.15.185.in-addr.arpa. exis... [22:49:20] 06cloud-services-team, 10Toolforge: toolforge webservice logs -f not robust to invalid output - https://phabricator.wikimedia.org/T383742 (10Don-vip) 03NEW [23:03:39] (03open) 10raymond-ndibe: [toolviews] refactor in preparation for new features [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/11 (https://phabricator.wikimedia.org/T317953) [23:03:52] (03update) 10raymond-ndibe: [toolviews] refactor in preparation for new features [toolforge-repos/toolviews] - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/11 (https://phabricator.wikimedia.org/T317953) [23:04:14] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] (major_refactor) - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [23:04:28] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] (major_refactor) - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [23:07:06] (03update) 10raymond-ndibe: [toolviews] add tools on-wiki edits to toolviews [toolforge-repos/toolviews] (major_refactor) - 10https://gitlab.wikimedia.org/toolforge-repos/toolviews/-/merge_requests/10 (https://phabricator.wikimedia.org/T317953) [23:50:41] FIRING: CloudVPSDesignateLeaks: Detected 1 stray dns records - https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Designate_record_leaks - https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack - https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks [23:59:13] 06cloud-services-team, 10Toolforge, 13Patch-For-Review: add on-wiki edits of toolforge tools to toolviews report - https://phabricator.wikimedia.org/T317953#10460784 (10bd808)