[04:00:44] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10351489 (10fnegri) 05Resolved→03Open p:05Triage→03High This has just caused a WMCS proxy outage, beca... [07:22:38] <_joe_> FYI, I've added two renewed/new keys for hpe packages and k8s, respectively, to reprepro; I have also temporarily removed the pyall component as the gpg key is expired. I did so in a way that should be easy to revert/recover [07:34:17] thanks, the pyall component isn't used anymore, I'll remove the underlying profile from puppet later [08:25:27] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: spamassassin broken for VRTS - https://phabricator.wikimedia.org/T380396#10351598 (10Krd) This looks much! better. Thank you. [08:50:00] qq about perccli - we shouldn't install it and deploy the related raid monitoring on supermicro hosts with the SAS controller [08:50:27] but I don't get from puppet where we force it to all hosts [08:50:54] I saw default_facts.yml but it also lists a SAS controller-related class, that is deployed only on some ms-be nodes [08:57:14] not sure what you mean with "we shouldn't install it" ? [08:57:32] the raid fact detects controllers based on PCI IDs [08:58:12] in the past we had some heuristics based on device names exported by the controllers, but that became very tricky since some controllers ended up reusing some device names [08:58:22] modules/raid/lib/facter/raid.rb [08:58:51] so if we have issues with these controllers on Supermicro, maybe we need to update the list of PCI IDs? [09:02:10] moritzm: yes yes I came up with the same file, I didn't know about raid.rb, I was confused since perccli seemed to be included regardless but there is some logic behind [09:02:41] the sas controller is not listed in there, I am trying to see what "cli" supports it [09:03:29] my bad ok I get it [09:04:01] the SAS39xx is listed and perccli is associated to it, but it doesn't support anything without a BBU [09:04:14] going to fix it [09:04:16] it's a little hard to find by grepping for the class name since "class raid" does an include based on the fact name [09:04:30] I was missing all these things, thanks for the explanation :) [09:05:37] can we determine the absence of a BBU on the OS level? then let's add a check to the raid fact [09:07:21] Checking a little better, I see "Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'System Overview'", I mentioned Bbu since I recalled that Jaime found out about the BBU absence of the Supermicro config J from puppet [09:07:26] but it may be different [09:08:32] https://phabricator.wikimedia.org/T377853 [09:09:00] it may have change since we flipped BIOS -> UEFI [09:09:52] and also because we flipped all disks to JBOD [09:10:12] does the SAS39xx have a different PCI ID compared to our existing perccli hosts from Dell? [09:10:35] same from what I can see, this is why perccli is installed [09:10:55] ok [09:11:32] these tools tend to be renamed every few yours, but the underlying codebase is likely mostly the same (all those lovely camel-based option names!) [09:12:07] but maybe storcli is in fact a red herring and we can also use it with perccli and only need fixes in the Nagios check to adapt for UEFI [09:12:14] okok I am slow on Monday - IIUC Jaime changed /usr/local/lib/nagios/plugins/get-raid-status-perccli to use storcli (found on Broadcom's website) and that failed with KeyError BBU etc.. [09:12:52] so the SAS controller that it was shipped with config J (some ms-be nodes, thanos-be basically) [09:13:13] doesn't have the BBU, and we are going to leave in this way since Matthew wants to use JBOD only [09:13:54] we weren't sure at the time IIRC about the absence of the BBU, now it is a fact, so probably storcli + a modification of get-raid-status to allow the absence of it could be ok [09:15:16] but if the PCI IDs are identical and we can drive the old systems with perccli, that would indicate that we can also drive the Supermicrro controllers with perccli, right? [09:15:36] after all if the PCI ID is the same, to the kernel they are 100% the same [09:16:12] or we move to storcli in general [09:16:35] I think the latter, check what Jaime wrote in the task description of T377853 [09:16:36] T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853 [09:16:36] but I'd want to avoid some scheme we some systems use different CLI tools for essentially the same hardware [09:16:58] with perccli64 "Response Data" misses "System Overview" [09:17:08] on the Supermicro SAS controllers [09:21:43] anyway nothing incredibly urgent, but I'll follow up on it to figure out what's best [09:25:40] 10Mail, 10Bitu, 06Infrastructure-Foundations: Don't get password reset emails for my alt through IDM - https://phabricator.wikimedia.org/T371612#10351755 (10SLyngshede-WMF) 05Open→03Stalled Cannot reproduce and we've where not able to find any indication in the logs that the email where not sent. Pleas... [09:25:55] 10Mail, 10Bitu, 06Infrastructure-Foundations: Don't get password reset emails for my alt through IDM - https://phabricator.wikimedia.org/T371612#10351758 (10SLyngshede-WMF) 05Stalled→03Declined [09:55:12] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10351959 (10MatthewVernon) It's worth noting here that this is causing icinga to never be happy on the ne... [10:17:36] <_joe_> moritzm: I *think* hashar's tox image uses pyall tbh [10:17:50] <_joe_> but as I said, that should be dismissed [10:21:20] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10352059 (10cmooney) Link has been clean since the optic was replaced: {F57745141 width=600} I'll sug... [10:22:44] I have switched the CI images to use pyenv to install various python versions [10:23:41] pyall is still in the releng/tox-buster image which is still used in some jobs [10:23:45] which surely should be migrated [10:23:53] so essentially, feel free to drop the pyall compoent :) [10:32:38] <_joe_> hashar: yeah that's what I meant :) [10:39:19] at a quick glance, one of those images are used to test Cergen which is being phased out [10:39:27] so most probably the remaining usage are in a similar situation [10:39:41] 07Puppet, 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10352131 (10jcrespo) >>! In T377853#10351959, @MatthewVernon wrote: > It's worth noting here that this is... [11:17:55] pyall as a repository component can remain (it simply won't be updated but it also doesn't seem updates "upstream"), we only discard those when a full distro is EOLed [11:18:23] I was referring to the puppet profile, which is likely not used on CI (but PCC will tell me otherwise) [11:31:24] 10netops, 10Cloud-Services, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54 - https://phabricator.wikimedia.org/T380503#10352382 (10cmooney) Ok the BGP downpref policy has been reverted, and we have routed traffic back runn... [11:59:35] <_joe_> hashar: I've created https://phabricator.wikimedia.org/T380730 [13:26:06] thanks but it is already tracked [13:26:26] and iirc the blocker is decommissioning Zuul [13:26:48] anyway, the removal of python-all is not a blocker ;) [17:04:01] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10354356 (10Andrew) From Gerrit, @dcaro writes: > > Did a quick test, there's three functions we use to res... [17:04:06] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10354358 (10fnegri) a:05fnegri→03Andrew Assigning this task to @Andrew as he's currently working on a patch. [18:08:23] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#10354676 (10JMeybohm) I'm pretty happy with this. If it is not 100% correct, I did not notice so far: `lang=bash _cookbook_completion() { local cur cur="${COMP_WORDS[... [18:22:42] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 13Patch-For-Review: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10354753 (10Andrew) Nameserver is missing from the following hosts: cn-staging-1.centralnotice-staging.eqiad1... [19:49:03] 10SRE-tools, 06Discovery-Search, 06SRE, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10355108 (10bking) [19:49:07] 10SRE-tools, 06Discovery-Search, 06SRE, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10355104 (10bking) Per IRC conversation with @dcausse , we now have [[ https://wikitech.wikimedia.org/wiki/Search/CirrusS... [19:49:59] 10SRE-tools, 06SRE, 10Data-Platform-SRE (2024.11.09 - 2024.11.29), 03Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10355110 (10bking)