[04:08:41] 10Traffic, 10SRE, 10SRE Observability (FY2021/2022-Q2), 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10lmata) [04:11:34] 10Traffic, 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10lmata) [08:42:32] vgutierrez: hi! I'm running into a problem with broadcom 10G and bullseye installer in T285835 and know you updated nic firmware in the past, do you have pointers on how to do that ? [08:42:32] T285835: Thanos bucket operations sporadic errors - https://phabricator.wikimedia.org/T285835 [08:43:11] godog: basically you need to get into the iDRAC web interface and upload the FW there for the NIC [08:44:52] vgutierrez: ack, thanks, no way to update from within the OS heh ? :( [08:45:22] nope AFAIK [08:45:34] sad_trombone.wav [08:45:41] so essentially this https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Updating_Firmware [08:45:42] megacli might be able, not sure [08:45:47] dcops guys will know more tricks [08:46:01] to connect to the UI you can follow https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Connecting_to_mgmt_interface [08:46:11] the browser part [08:46:13] godog: tthat's right.. that's what I usually do [08:46:40] ack, thanks all! giving it a try now [08:46:53] these nics are the gift that keeps on giving [08:47:14] we ran into a similar/same issue on buster upgrade didn't we? [08:47:56] godog: question, do you need new firmware on the nic or might be an issue with the kernel driver? [08:49:49] volans: I don't know for sure, it could be a bullseye kernel driver issue too [08:50:02] got it [08:51:10] oh wow the newer idrac web interface is fancy [08:56:50] don't click too much around, you might crash it :D [08:57:59] lolz [08:59:17] ok I apologize in advance and not trying to be dense here, which firmware format is required from the dell's download page for the 10G nic? also [08:59:33] Broadcom NetXtreme vs Broadcom NetXtreme-E ? [09:00:12] which hardware do you have? [09:01:18] if you go to the dell's page with the serial does it still show you both? [09:01:29] there is a shortlink in netbox [09:02:01] like from https://netbox.wikimedia.org/dcim/devices/2634/ [09:02:02] yeah the shortlink is super useful, I'm on the downloads page but both are shown [09:02:15] ah... dell [09:02:17] Broadcom Adv. Dual 10Gb Ethernet [09:04:12] on the host I see [09:04:12] product: BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller [09:04:24] from lshw [09:04:48] if we're talking about thanos-fe2001 [09:05:17] the web interface will refuse to use the wrong FW, so it isn't that bad :) [09:05:41] volans: yeah that's the host [09:05:49] ok so netxtreme-e it is, thank you [09:06:04] going to document all of this for posterity [09:06:11] thx [09:06:58] sadly I can't seem to find the same info from the idrac web interface heh [09:07:31] ah no there it is, nevermind [09:22:57] yep that was it, the nic fw upgrade -.- [09:25:18] yay [09:26:13] added very minimal docs here, PTAL https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#NICs [09:39:05] ack, thanks! [12:00:13] godog: above thread is interesting, wondering did the firmware update go ok? and did it fix the issue with link detection? [12:12:43] topranks: it did! firmware upgrade was fine and now bullseye works [12:13:17] excellent! nice work, good to know if I run into such an issue :) [12:14:12] hehe thank you, I did very little in this case and luckily there was a lot of prior art but good to know indeed [12:29:26] 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) [12:31:44] 10Traffic: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) This is a Letsencrypt cert, so probably auto-renews. But we still have 2 alerts in Icinga expecting this needs attention if it expires in under 30 days. I acked those. [12:46:15] 10Traffic, 10SRE: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Vgutierrez) indeed, it's auto-renewed by acme-chief, we should tune those checks. The new cert has been issued already and it's being staged to avoid client-side clock skew issues: ` Ju... [12:47:10] 10Traffic, 10SRE, 10good first task: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Vgutierrez) p:05Triage→03Low [12:56:43] 10Traffic, 10SRE, 10good first task: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold? If traffic doesn't need... [13:26:58] godog: I did some digging with Cumin and we have over 400 servers with that NIC.. [13:26:59] I'll open a task for a) awareness b) tracking and if this really affects all 400 c) figuring out some automation [13:27:19] and thanos-fe2001 was in fact the first server with that type of NIC to get reimaged to bullseye [13:27:51] so much for my "I wouldn't expect any further surprises" yesterday :-) [13:29:55] moritzm: hahah! indeed, the infinite source of surprises [13:31:58] lol [13:32:05] * volans also hides for the automation part [13:33:42] well they are no longer surprises now that we know about it [13:38:50] 10Traffic, 10SRE: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Aklapper) @Vgutierrez: A #good_first_task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributo... [13:42:44] 10Traffic, 10SRE: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Vgutierrez) >>! In T286713#7215004, @Dzahn wrote: > Should I just remove those checks or adjust them to stop caring about cert expiry? Or should they be kept but with lower threshold?... [13:45:23] 10Traffic, 10SRE: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) ACK! So.. we still want to monitor if TLS works on planet and phabricator, we just don't want to deal with cert expiry anymore. We need to create a new checkcommand probably. On... [13:45:42] 10Traffic, 10SRE: Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) a:03Dzahn [13:46:15] 10Traffic, 10SRE: (adjust cert monitoring on planet and phabricator) Certificate *.wikipedia.org valid until 2021-08-14 08:01:46 - https://phabricator.wikimedia.org/T286713 (10Dzahn) [13:59:34] opened https://phabricator.wikimedia.org/T286722 [14:08:59] * volans subscribed [14:09:07] should we add some tags for the owners of the hosts? [14:09:58] that'll be practically all and a bit spammy, I'll add a note for the next SRE meeting [14:12:18] ack