[08:52:41] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) >>! In T329272#8614270, @jbond wrote: >> alarms: true we can set based on the device model (false by default as we... [09:43:14] 10netops, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 10 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10MoritzMuehlenhoff) [10:49:55] hi all I have a question on pcc and how facter variables are handled. [10:50:12] We have added a facter to get the airflow_version and compare it for some logic using versioncmp. While testing this on pcc the ::airflow_version facter result is always Undef, [10:50:12] raising the questions is it possible to define facts and test on pcc? or should they be defined first, merged then we can test? [10:53:07] steve_munene: PCC can't be used to test new facts, because it does not have access to the machines itself so it can't actually resolve the facts. Instead there is a systemd timer which uploads the latest facts from production PuppetDB to PCC once a day. [10:53:38] so you would need to merge the fact, run puppet on the affected machines, and then either wait for the timers to run or activate them manually [10:54:23] ahaa, thanks taavi [10:55:52] steve_munene: further to what taavi mentioned you can also right an rspec test to test facts https://wikitech.wikimedia.org/wiki/Puppet/Testing#Rspec https://github.com/wikimedia/operations-puppet/blob/production/examples/spec/classes/commented_class_spec.rb [10:59:11] from the commented example above you would need to replace #12 with something like `let(:facts) { os_facts.merge(myfact: 'value') }` [11:03:00] thanks for the rec jbond checking it out. [11:03:02] 10SRE-tools, 10Infrastructure-Foundations: improvments to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) p:05Triage→03Medium [11:09:48] 10SRE-tools, 10Infrastructure-Foundations: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10Aklapper) [12:04:52] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks - https://phabricator.wikimedia.org/T329669 (10ayounsi) > i was also curious what is the is_pool property used for? It's best effort when created prefixes and not used for anything, it can be... [12:15:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Sporadic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10Ladsgroup) FWIW, mw should not send this many cross-dc connections to databases but I assume it's a different aspect of this problem. [13:04:08] jbond, moritzm: could I ask one of you to review user-add to analytics-privatedata for me? [13:04:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/889535 [13:05:28] * jbond looking [13:07:03] topranks: done [13:07:26] jbond: thanks that's great :) [13:24:05] XioNoX: interesting results on those tests with cloudflare! [13:24:52] seems like them merely announcing our prefix/routing traffic doesn't affect Russia, but when mitigation is engaged it's actively dropping traffic from there? [13:27:51] yeah exactly [13:37:14] the challenge thing makes me wonder if they're not mixing up their L7 and L4 mitigations [14:01:07] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: improvements to firmware upgrade cookbook - https://phabricator.wikimedia.org/T329722 (10jbond) > in the same situation of above, if selecting download from dell it then fails because tries to install the same version and apparently the iDRAC is... [14:10:03] XioNoX: yeah that is odd, I wasn't sure how to interpret the 'challenge' name there. [14:10:07] you could well be right [14:11:06] jbond: do you know how we manage files in /srv/firmware/ on the cumin hosts? [14:11:42] I see cumin2002 has a much healthier selection available (I suspect due to pa.paul and dcops using that more) [14:12:40] topranks: its not really managed as such. the firmware-upgrade cook book downloads files on demand into that directory. [14:13:12] hmm ok yeah I was looking at that code a bit, seen that [14:14:52] I guess why I'm asking, for a recent re-image I was involved with we had to update the NIC firmware [14:15:49] when I did so from cumin1001 it did not give me an option of the "known good" BCM firmware, 21.85.21 [14:16:14] it did offer 2 more recent versions, and I selected the latest [14:16:32] for whatever reason that only seemed to update the on-board NIC, not PCIe card [14:17:11] (I'm assuming perhaps cos that option only valid for on-board, didn't look deeper) [14:17:29] Running on cumin2002 I get the option for 21.85.21 version, I expect cos the files are in /srv/firmware/ there [14:17:58] jbond: so my goal was to get that file on cumin1001 too, so it would also be given as an option there? [14:18:20] a simple rsync / scp is an option but I didn't want to mess anything up [14:21:27] topranks: 1) if you select "Download file" when presented with a list of files it shuld offer yu the chance to download the latest. this wouldn;t work in your case though as you need to use not the latest version. As to upgrading embeded vs onboard nic thats a bug im trying to fix [14:22:23] 2) i dont have any objectsions ti setting up rsync however it would need to be multi directional i.e. all cumin host sshould push any missing files to all other cumin hosts [14:23:41] jbond: On 1) ok, is the on-board being done a known issue? i.e. it wasn't the version I chose it would have only done on-board even had the appropriate one ben an options? [14:24:38] in terms of 2) I don't think we need anything particularly advanced right now, what I'd suggest I do for now is just copy over the particular NIC firmware we know works (given they seem to be a bit more picky) from cumin2002 to cumin1001 [14:25:08] re 2) oh of course feel free to do a one of copy that will not cause any issue [14:25:09] not suggesting a proper sync service isn't a good idea, but just for the reimages WMCS are doing now be good if they had the right option on both boxes [14:25:17] cool [14:26:29] the issue with 1) is a bit more in depth. currently the cookbook dosn;t really ask useres which nices they whant to upgrade. The nic that actully gets ugraded depends on the upgrade file you select i.f. you select one that targets the onboard that gets upgraded. if you ick one that targets the pci card that one gets upgraded [14:26:52] im currently look at this issue now to see if we can be a bit more intelagent about thos [14:26:56] *this [14:26:56] ah ok [14:30:57] actually looking at what I got, running the cookbook now gives me different options than on Monday: [14:30:59] https://phabricator.wikimedia.org/P44669 [14:31:33] In terms of how the cookbook works, upgrading firmware for device based on selected file, that seems fine (if maybe a little unclear to new users) [14:32:18] And given they are both broadcom devices, and maybe not knowing what branding is used "net-extreme" etc, it wasn't clear to me the option I was selecting applied to only on-board device [14:32:42] Anyway I kind of suspected it was something like that, it was in a break-fix of sorts so I just did it through the web gui and was fine [14:33:14] topranks: if a file allready exists in the firmware directory then you will first be presented with thos files [14:33:28] if no file exists then we grab a list of files from dell and ask the user to download [14:34:09] however im trying to change this at the moment so we always select the most recent file and then present the user with a list of files [14:34:40] and that means i need to work out how to better identify the nic actully installed in the server so i can auto select the correct firmware file [14:37:09] jbond: gotcha [14:37:29] so no files existed the other day, the list I was presented with was a list of those available to download [14:37:41] yes [14:38:11] and because I selected one today the menu gives me that option, plus a 'download' one [14:38:45] exactly, if you select the download option yu should get the same list you got yesterday [14:39:21] yep just did to check [14:39:37] so that's fine really, sure we could have some improvements but once you know how it works it's ok [14:40:01] it definetly needs improving :) but yes todae the main useres have been dc-ops [14:40:14] I think there is a slight issue in that the only download option for the BCM card is the newer one we know has a problem [14:40:39] I will copy across that file I mentioned to cumin1001 so the known good one is presented from the "already downloaded" list just in case [14:40:50] yeah I was only doing it cos it was somewhat break-fix [14:40:55] thanks! [14:41:13] tbh im not sure the best way to resolve that 1) the way the dell web pages work (we are scraping) makes it hard to discover older driver files [14:42:01] 2) im not sure if it makes senses to encode this in the cookbook/spicerack (i dont want to have to create a cr every time we have a broken driver) of if we just keep it on wiki tech [14:43:50] yeah it's tough, I the options pulled from Dell gave me the latest 2 [14:44:02] currentl its on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: (but that is not the easiste to find) [14:44:22] in an ideal world it'd be able to pull more but I think given it's mostly dc-ops doing this, and they know where to find that page, we're probably good [14:45:38] dell seems to go quite far out of there way to make it hard to discover any of this stuff automaticaly (anti scraping techiniques, no public api, obfuscated private api etc) prefering to try and push there sub par product [14:46:20] there are hooks in the code to add all of this i just need to work out the correct way to backwards enineer the dell web site :) [14:56:11] yeah they sure don't make it easy! [14:56:26] btw the whole firmware update stuff worked great for me though, really nice work :) [14:57:48] +1 :) [14:58:40] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) Circling back on the network side config now that there are a few patches out to improve the server side.... [14:59:27] thanks [17:24:14] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10brennen) [17:24:28] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10demon) [17:32:04] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack dnsdisc.Discovery attempts to query depooled/disabled dns auth servers - https://phabricator.wikimedia.org/T329773 (10CDanis) [17:41:17] 10Puppet, 10Infrastructure-Foundations: Tidy up the taskgen script - https://phabricator.wikimedia.org/T329777 (10jbond) p:05Triage→03Low [17:44:43] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10serviceops-collab, and 3 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10thcipriani) @demon looking into what's need on the GitLab side; maybe "just" configuration 😂 [21:58:19] 10netops, 10Infrastructure-Foundations, 10SRE: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) p:05Triage→03Medium [21:59:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) [22:01:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) The above patch addresses the issue by ensuring Homer adds an MTU of 9192 on any L2 switch ports which don't have...