[07:44:41] I had to run a bmc reset on a couple of HP dbs to get the management interface up [07:45:41] commenting it because there are a few ms-be mgmt hosts down too simce the same time- maybe that will fix it for those too [08:05:51] something to "look forward" to when I come to reimage them, I guess [08:08:13] while down, not only ssh was unavailable, also remote ipmi [08:11:47] where did you do the BMC reset from, then? [08:12:33] the actual host [08:13:14] it is documented at: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card [08:14:35] thanks - I was foolishly looking at https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook to no avail... [08:15:07] that whole page is great, recommend having a general look [08:15:36] otherwise it will be difficult to find what can be done when you suffer from ir [08:15:41] *it [09:24:15] Just reimaged a host OK, and re-run wmf-update-known-hosts-production, but my ssh config still has the old wrong hostkey. I thought this was meant to automatically work? [09:24:30] ah, got there now, needed more patience... [09:26:09] Emperor: puppet on the puppetmaster regenerates the list of ssh keys, so it can take up to 30mins for it to update. (or you can just `run-puppet-agent` on puppetmaster1001 to get it sooner) [09:26:35] Emperor: wmf-update-known-hosts-production gets the list from config-master.w.o, so until puppet runs there it's not updated [09:30:03] that's puppetmaster1001/2001 at the moment fwiw [09:30:55] thanks, that will enable less patience in future ;-) [09:31:29] we could add a forced puppet run there to the reimage cookbook if that's helpful [09:33:48] I might be unusual, but I usually want to log into the newly-reimaged host almost immediately (to check everything looks OK before repooling) :) [09:55:45] Emperor: for you, sir :) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/786894 [10:04:03] TY :) [10:41:27] doing the codfw frontends now (again, except the swiftrepl node) [11:40:38] Emperor: did the patched reimage worked as expected? [11:42:16] volans: yes, thanks :) [11:43:25] glad it helped :) [13:20:09] Hm, this reimage is not going very well. It timed out the first time, and now on retry I am watching on virtual console and it's got to "Probing EDD (edd=off to disable)... ok" and now seems to have stopped [13:20:23] [on ms-be2040] [13:21:23] godog: would you expect backend reimages to Basically Work? [13:22:30] marostegui: can i get you to have a look at T306983 and the related CR, pls? [13:22:31] T306983: Reboot pc1012 - https://phabricator.wikimedia.org/T306983 [13:22:47] marostegui: i figure as the reboot period is so short, there's no real reason i need to wait a few days for pc1014 to populate [13:22:52] (correct me if i'm wrong :)) [13:23:00] Emperor: I'd expect so yeah [13:23:24] godog: I might just be impatient, but ms-be2040 reinstall seems to have just hung :( [13:23:43] kormat: I would not wait no [13:26:56] Emperor: mmhh sth that comes to mind (thinking out loud) is old hardware + firmware and newer kernels, is it hanging pre d-i ? [13:27:25] old hardware... could have buggy firmware. we've run into that a bunch of times. [13:28:00] aye, this host is from early 2018 heh [13:28:13] godog: yes, it's hanging really early in the boot [13:28:41] godog: gets to "Probing EDD (edd=off to disable)... ok" and that's it [13:29:08] something like that happened to me-I was unable to PXE boot it, although I didn't remember at what point it couldn't load [13:30:03] I searched 'probing edd' on phab and got back this https://phabricator.wikimedia.org/T260370#6598624 though no mention on what was wrong and/or how to fix it [13:31:03] I'm starting to despair of ever getting swift upgraded :( [13:31:34] in many cases a firmware upgrade helped, if it's the same issue [13:31:41] you can ask dcops to do it [13:32:56] I guess I can try that; it's a lot of hosts if they all need firmware upgrades [13:35:19] [is there a specific form of ticket for that, or just a regulat phab task? e.g. hardware decom has a particular form to fill out; wikitech search suggests not, but...] [13:36:19] Emperor: an example task: T286226 [13:36:20] T286226: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226 [13:45:55] I assume these hosts must have been PXE-bootable in the past, I guess it was optimistic to think that meant they would still be PXE-bootable :-/ [13:52:26] Emperor: that's being delusional :D [13:53:52] * Emperor has made T306988 [13:54:34] I mean, I kindof new this upgrade task was going to be can of bizarre yak-worm hybrids, but still [13:55:53] verily, you are wise [13:57:19] it's not like they all go out of security support soon or anything... [14:00:31] kormat: I'll save that comment for my next ITC ;-) [14:00:55] Emperor: you might want to drop the attribution, if you want it to _help_ you. ;) [14:01:06] lol [15:12:24] dbstore1005 is alerting, btullis [15:13:33] jynus: Thanks. Looking into it now. [15:14:14] maybe restarting replication was just missed for x1 after restart [15:18:59] I didn't know that `skip-slave` was the default :-) [15:20:15] AFAIK it was prefered like that for production (for crashes), but I belive it is configurable