[06:12:58] yes, the reimage cookbook already prompts for the management (drac) password, so it's likely ok to prompt for the mysql one as well [08:42:53] Amir1: as taa.vi said if it's ok that the password lives on a config file on disk on the cumin hosts and in the private puppet repo cookbooks can easily read config files generated from puppet in that profile, see for example https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/puppet/sync-netbox-hiera.py#127 [08:43:38] thanks [11:24:04] Hi, everyone. zhwiki is disussing about creating a special signing portal for users in mainland China. (Details: ) Some suggested that this portal should use domain fronting to provent blocks from the government. I'm coming to ask if it is feasible to do [11:24:04] it (say the portal is really deployed)? [11:25:37] If which part is feasible? [11:25:44] about domain fronting [12:20:28] Reedy: Probably I can give a more detailed explanation here.... (full message at ) [12:21:24] cc SCP-2000- [13:36:04] If no one objects, I'm going to canary bullseye on a sessionstore host in codfw at ~15:00 UTC (I'll de-pool first Just In Case™) /cc cwhite sukhe jbond [13:37:20] urandom: thanks and gl! [13:49:10] +! [13:49:12] +1 [14:06:24] oh my, what are the implications of a +! ? :) [14:08:12] if that's never entered any online lexicon, we should introduce it now [14:09:46] factorial approval [14:11:55] That seems dangerous [14:36:08] claime: dangerous good, or dangerous bad? [14:36:18] urandom: Just dangerous. [14:36:24] Approval overflow [14:36:26] so good. +1 [15:00:22] Ok, I'm going to proceed with depooling codfw sessionstore, going, going... [15:03:42] ...gone [16:23:40] Sooo... I was reimaging a host, and it failed (it failed waiting for the reboot into the installer, I believe). For some reason, I cannot locate the log file (not in `/var/log/spicerack/sre/hosts/reimage`), and re-running it will seem to require `--new` [16:24:38] I feel like this might be a good time to seek guidance :) [16:26:15] urandom: on cumin1001, where it seems like you ran it [16:26:19] /var/log/spicerack/sre/hosts/reimage.log [16:26:42] grep eevans returns your run there [16:27:55] oh, duh, ok [16:28:19] I was seeing past logs that corresponded with reimages I did, and assumed this would be the same [16:29:12] oh, I guess those aren't exactly logs [16:29:42] sukhe: should I just rerun with `--new`? [16:30:06] these are some huge log files btw, going all the way back to 2022 [16:30:07] (that log file was as unhelpful as the console output :)) [16:30:32] urandom: since it's probably easier to parse from the output to the console, what was the error? [16:30:50] and yeah, --new should work if you want to run it again but I think we should look at why it failed [16:31:12] urandom, sukhe: need a hand? [16:31:13] it just reached the limit of retries waiting for the reboot into installer [16:31:22] volans: oh hai! [16:31:30] which host? [16:31:37] sessionstore2001 [16:31:38] sessionstore2001.codfw.wmnet? [16:32:41] urandom: volans might have a better idea but for the ones I ran into this issue, attaching to the console might yield some helpful output [16:33:07] console is always a good starting point [16:33:09] from the cumin host, [16:33:11] I'm looking at the logs [16:33:11] sudo /usr/local/bin/install_console sessionstore2001.mgmt.codfw.wmnet [16:33:17] and the running the cookbook [16:33:32] sukhe: oh, cool [16:33:58] urandom: bullseye upgrade? [16:34:13] sukhe: yes [16:34:41] if it's the R440, might need a NIC firmware upgrade [16:34:44] have you done that? [16:34:48] nope [16:35:21] (and it is an R440) [16:35:22] let's check that [16:35:23] yeah [16:35:24] yeah didn't get to d-i so probably didn't PXE boot [16:35:34] yeah, so NIC firmware most likely then [16:35:38] so firmware... OK [16:35:48] [protip] for hosts in codfw is slightly quicker to run the cookbook from cumin2002 ;) [16:40:17] urandom: yeah it's failing for me too [16:40:36] one other thing to be careful about in this is [16:40:36] > Broadcom NetExtremeE firmware for 10G nic should only upgrade to 21.85.21.92, as 22.00.07.60 breaks installer. [16:40:55] what's happening here is that the iDRAC firmware is old as well and the cookbook is failing [16:41:30] 🤯 [16:41:54] sukhe: so... upgrade idrac, then nic? [16:41:58] iDRAC Firmware Version 3.21.21.21 [16:42:00] yeah, this is it [16:42:18] urandom: yep [16:42:27] Broadcom Gigabit Ethernet BCM5720 - D0:94:66:8F:CA:FE 20.8.4 [16:42:28] Broadcom Gigabit Ethernet BCM5720 - D0:94:66:8F:CA:FF 20.8.4 [16:42:36] this needs to be 21+ [16:42:39] so nic first, then the firmware [16:42:55] but I think the cookbook for the NIC won't work with this because of the redfish API version [16:43:04] so the iDRAC needs to be manually updated [16:43:17] oh. [16:43:19] heh [16:44:06] yeah the cookbook is a life saver but needs the basic minimum idrac firmware [16:44:16] want me to do it? we suffered through this for the cp hosts so I have the right firmware and all [16:44:22] and then I can share how to do it for future [16:44:47] sukhe: oh, yeah, that would be great! [16:44:51] on it [16:47:00] hardware is the worst [16:47:11] urandom: since we suffered through this for the cp ones and no one else should :P [16:47:22] the cookbook works best with idrac firmware 5+ [16:47:37] usually puppetboard has the version of the firmware but since in this case the cookbook failed [16:47:40] ssh -L 8000:sessionstore2001.mgmt.codfw.wmnet:443 cumin1001.eqiad.wmnet [16:47:57] you can find the versions here (username root, password mgmt password) [16:48:11] and then we need to manually upload the idrac firmware, install, reboot mgmt interface [16:48:25] then run cookbook for NIC upgrade, making sure to not do anything other than 21.85.21.92 [16:48:56] don't run anything now as the idrac is rebooting [16:49:02] so it will fail [16:49:11] +1 [16:49:44] we had so much fun (/s) with this for T321309 [16:49:45] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:49:58] will an upgrade via the cookbook install a version of nic firmware that is too new? [16:50:17] if you are already above 5+, then yes, you can go to six [16:50:22] but anything below 4 needs a manual upgrade [16:50:33] which is what everyone used to do before j.bond and v.olans worked on the cookbook :) [16:50:59] this is the DRAC though right? [16:51:05] yeah [16:51:30] You indicated we could use the cookbook for the NIC, but not to exceed 21.85.21.92 [16:51:35] yep that will work [16:51:48] I meant for the iDRAC, the cookbook won't work for upgrading iDRAC if it is below 5 [16:52:03] so first manual iDRAC upgrade to 6 or something, then cookbook for upgrading NIC [16:52:11] the cookbook draws from a default store of these somewhere, doesn't it? Does that contain the version that is too new? [16:52:43] or was that a general warning in case I might upgrade via the mgmt interface "while I was there" :) [16:52:52] yeah that has newer versions that fail the bullseye upgrade :) [16:53:00] outstanding [16:53:16] the cookbook will show you 22.x as well but we have seen that break d-i. 21.85 is the only one that works [16:53:35] oh, the cookbook will prompt then? [16:53:41] yeah it will ask you for the version # [16:54:00] gotcha, that was piece I was missing (I don't think I've yet done this) [16:55:09] yeah, brett and I discovered this through trial and error with help from dc-ops [16:55:24] that was a lot of fun [16:55:31] those were the days [16:55:32] haha [16:55:35] * brett strokes picture frame [16:56:33] "firmware" and "trial and error", is there a better recipe for fun? :) [16:58:37] :) [16:58:49] install worked fine but still can't get the right version [16:59:01] brett: any chance you remember how to racreset on really old versions? [17:00:09] ah oh, help told me [17:00:20] good, coz I don't remember :) [17:00:26] you are not alone! [17:00:57] it's not uncommon to repress tragic memories [17:01:14] s/tragic/traumatic/ ? [17:01:19] haha [17:01:23] urandom: is it fine to powercycle the host? [17:01:28] I am guessing yes but wanted to check [17:01:31] yes [17:12:38] sukhe: is it still powercycling? [17:13:53] urandom: tried uploading the firmware again and waiting for it to finish [17:13:59] should work this time, let's see [17:14:08] 🤞 [17:14:09] it's a coin toss when it fails the first time :) [17:21:37] urandom: [17:21:44] try the firmware now for NIC [17:21:46] sudo cookbook sre.hardware.upgrade-firmware "sessionstore2001.codfw.wmnet" --new --no-reboot -c nic [17:21:50] 0: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE [17:21:53] 1: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_XKX9M_WN64_22.31.6.EXE [17:21:56] 2: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_230WD_WN64_22.21.07.80_01.EXE [17:21:59] 3: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_DFF48_WN64_22.00.6.EXE [17:22:02] 0 here [17:22:05] try it please [17:22:23] 👍 [17:22:43] effie: joe has been wondering what is going to happen to jobrunner load if we turn off pre-generation requests to the parsoid cluster. When we do that, more parsing will happen on the jobrunners. But we don't know how much more... [17:23:15] sukhe: seems to be working! [17:23:42] 🤦‍♂️ [17:23:44] effie: I realized that we can test for this by telling the Parsoid endpoints to not write to the parser cache. Then the jobrunners will have to do all the work again. We will be parsing twice, but we will know how much load parsing causes on job runner. [17:24:19] urandom: nice! [17:24:26] ...there is a know I can turn to make this happen probabilistically, we could disable cache writes for 20% of the cases, or 50%... what do you think? Would that be a good experiment? [17:24:26] I think you can try now but attach to the console [17:24:28] to see the output [17:24:30] and it should work [17:24:36] sukhe: no, it failed. [17:24:42] cookbook upgrade failed? [17:24:49] https://www.irccloud.com/pastebin/S9UgfVOy/ [17:24:52] yes [17:25:04] interesting [17:25:08] first time :P [17:25:21] joy [17:25:47] check with jb.ond [17:26:05] the message says the firmware is not for that model [17:26:16] trying [17:27:03] yeah [17:27:11] urandom: in a meeting so will look shortly again [17:27:19] but volan.s is right, j.bond might know better [17:28:18] heh, you both went to lengths to avoid summoning them, not sure how to read that :) [17:32:51] jbond: if you are perchance still around; does the above error make sense to you? [17:43:09] Puppet or any other automation doesn't maintain state on DRACs, does it? I changed the PXE nic of wdqs2021 and reimaged it yesterday and the PXE nic seems to have been changed back [17:44:13] either that or the change did nothing, and the 3x failed reimages before that were due to other transient problems [17:58:20] inflatador: I believe the reimage cookbook did probably reset your PXE settings. because "Force next boot to go via PXE via IPMI" .. "Unless --no-pxe is set" [17:59:26] and it does have the mgmt password to do that [18:11:18] urandom: [18:11:25] trying one more thing, then we wait for j.bond [18:11:27] sorry about this :) [18:13:37] sukhe: I can't even can connect to the web interface, can you? (I get internal server error) [18:13:57] urandom: nope! [18:14:14] seems like something is broken [18:14:15] checking [18:15:24] so weird, and this is more than the weird firmware stuff in general! [18:15:28] reset also didn't change things [18:15:38] oh, yeah I tried that too [18:15:58] sad_trombone.wav [18:16:17] yeah indeed [18:17:42] urandom: I think it's time to file a task for this host [18:17:53] dc-ops needs to do a hard reset probably [18:18:13] what is a hard reset in this context? [18:19:40] something like T158131 [18:19:41] T158131: hard-reset DRAC gadolinium.mgmt.eqiad.wmnet - https://phabricator.wikimedia.org/T158131 [18:19:49] (credit to mutante) [18:20:22] unplug, reset [18:20:38] :) so afair there is soft and hard reset of the DRAC [18:20:47] or there might be 3 levels [18:21:04] soft (regular), hard (with --hard) and really hard (only dcops can locally) [18:21:04] mutante: aren't you on sabbatical? [18:21:15] brett: no, it's starting next month [18:21:21] oh, good [18:21:35] the soft level does not change the password [18:21:39] (that you're not responding during it, not that it's next month) [18:21:42] the 'really hard' level might set it back to default password [18:21:56] so better to ask dcops at this point [18:22:30] there is that wikitech page about IMPO, sukhe [18:22:35] it has troubleshooting steps [18:22:47] like "first try ipmi from remote" then "try ipmi from local" [18:22:56] did you get to that already? [18:23:15] mutante: good idea [18:23:17] let's try that [18:23:22] brett: :) no worries, I think I will actually quit from IRC, like shut down the client that day :) [18:23:42] https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card [18:23:56] mutante: oh looks like we can't do that, since we can't ssh to the host [18:23:58] ah, so it's called "cold" , not "hard" [18:23:59] --cold-reset [18:24:01] and this doesn't work for the manual one [18:24:05] er, management one [18:24:14] yeah let's try that [18:24:30] the "really hard" step is called "drain" [18:24:33] in that page [18:24:48] the "last resort" [18:29:47] urandom: fine to paste your https://www.irccloud.com/pastebin/S9UgfVOy/ link in the task? [18:30:07] ok, nothing PII in it, doing it [18:30:11] sure! [18:30:21] thanks [18:30:24] used to have to do power drains a lot at my old job...DRACs would get in a weird state where they'd accept your commands, but not actually do anything [18:30:49] sounds familiar [18:31:33] it can also be in a weird state where IPMI doesnt work over the network but still does from the local host [18:31:36] afair [18:32:06] re: reimage cookbook, I'll have to look at the code more closely. I would expect it to change boot order, but not which NIC is allowed to PXE boot (different setting) [18:32:39] I just assumed that "set back to PXE boot" also includes "default to first NIC" or something like that [18:32:46] but agreed [18:32:58] * urandom sighs [18:33:23] a wise man once said: "hardware is the worst" (hint: it was bd.808 ) [18:33:33] urandom: we talked about your luck once right? [18:33:33] :P [18:33:52] we did around ~172 hosts I think for Traffic and had to upgrade the firmware for them [18:33:58] yes :( [18:33:59] we had our share of fun but yeah, this is next level [18:34:12] yea, but not like we wouldn't be debugging stuff as well if the machines were virtual, just would be stuff like "doesnt come back from reboot after we added the second disk", heh [18:34:34] Making It Someone Else's Problem™ [18:34:49] SEP status [18:35:25] made me search for "firmware cookbook" in phabricator [18:35:32] T331135: firmware-upgrade cookbook fails after successful upgrade [18:35:33] T331135: firmware-upgrade cookbook fails after successful upgrade - https://phabricator.wikimedia.org/T331135 [18:35:58] "could just be down to idrac being a bit flaky" :) [18:36:21] yeah we had those after an upgrade [18:36:28] where the idrac would take forever to come back [18:37:46] Dell has a bash script that can do the firmware updates from the Linux layer. I've updated dozens or hundreds of hosts that way...but it presumes you have an OS to work with [18:38:10] inflatador: I think in this case though, the fact that we can't even open the https interface suggests a deeper problem [18:38:21] the running the cookbook thing comes in later (and assumes a working connection) [18:39:17] sukhe ACK, even if you have SSH to the machine, you probably don't want to try and do any FW updates in this state [18:39:23] you can try -dcops channel or ping Willy and see if someone is at the DC in person right now [18:39:28] to reseat it [18:39:41] that could well fix the web UI [18:40:00] do we know if the web UI ever worked on this specific host though? or is it new [18:40:07] it worked [18:40:10] ok [18:40:16] that's how I updated the idrac firmware [18:40:19] so there's that at least [18:40:39] oh, so basically it broke with the update? [18:40:49] was fine even after that [18:40:56] ok [18:40:56] but I think broke after running the cookbook to upgrade the nic [18:41:05] nod [18:41:22] not a 100% though but I was still clicking around after the upgrade [19:10:52] random reminder of the day: this special wiki page on mediawiki.org is actually _config_ for getting automatically added as reviewers on patches.. depending on your own rules. I am saying it because when I look at https://www.mediawiki.org/wiki/Git/Reviewers#operations%2Fpuppet not all SRE might be aware of it (anymore). it can be massive improvement to review turnaround time though. consider [19:10:58] "subscribing" to one or the other repo or class that you care about.