[07:44:41] <jynus>	 I had to run a bmc reset on a couple of HP dbs to get the management interface up
[07:45:41] <jynus>	 commenting it because there are a few ms-be mgmt hosts down too simce the same time- maybe that will fix it for those too
[08:05:51] <Emperor>	 something to "look forward" to when I come to reimage them, I guess
[08:08:13] <jynus>	 while down, not only ssh was unavailable, also remote ipmi
[08:11:47] <Emperor>	 where did you do the BMC reset from, then?
[08:12:33] <jynus>	 the actual host
[08:13:14] <jynus>	 it is documented at: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card
[08:14:35] <Emperor>	 thanks - I was foolishly looking at https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook to no avail...
[08:15:07] <jynus>	 that whole page is great, recommend having a general look
[08:15:36] <jynus>	 otherwise it will be difficult to find what can be done when you suffer from ir
[08:15:41] <jynus>	 *it
[09:24:15] <Emperor>	 Just reimaged a host OK, and re-run wmf-update-known-hosts-production, but my ssh config still has the old wrong hostkey. I thought this was meant to automatically work?
[09:24:30] <Emperor>	 ah, got there now, needed more patience...
[09:26:09] <kormat>	 Emperor: puppet on the puppetmaster regenerates the list of ssh keys, so it can take up to 30mins for it to update. (or you can just `run-puppet-agent` on puppetmaster1001 to get it sooner)
[09:26:35] <volans>	 Emperor: wmf-update-known-hosts-production gets the list from config-master.w.o, so until puppet runs there it's not updated
[09:30:03] <volans>	 that's puppetmaster1001/2001 at the moment fwiw
[09:30:55] <Emperor>	 thanks, that will enable less patience in future ;-)
[09:31:29] <volans>	 we could add a forced puppet run there to the reimage cookbook if that's helpful
[09:33:48] <Emperor>	 I might be unusual, but I usually want to log into the newly-reimaged host almost immediately (to check everything looks OK before repooling) :)
[09:55:45] <volans>	 Emperor: for you, sir :) https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/786894
[10:04:03] <Emperor>	 TY :)
[10:41:27] <Emperor>	 doing the codfw frontends now (again, except the swiftrepl node)
[11:40:38] <volans>	 Emperor: did the patched reimage worked as expected?
[11:42:16] <Emperor>	 volans: yes, thanks :)
[11:43:25] <volans>	 glad it helped :)
[13:20:09] <Emperor>	 Hm, this reimage is not going very well. It timed out the first time, and now on retry I am watching on virtual console and it's got to "Probing EDD (edd=off to disable)... ok" and now seems to have stopped
[13:20:23] <Emperor>	 [on ms-be2040]
[13:21:23] <Emperor>	 godog: would you expect backend reimages to Basically Work?
[13:22:30] <kormat>	 marostegui: can i get you to have a look at T306983 and the related CR, pls?
[13:22:31] <stashbot>	 T306983: Reboot pc1012 - https://phabricator.wikimedia.org/T306983
[13:22:47] <kormat>	 marostegui: i figure as the reboot period is so short, there's no real reason i need to wait a few days for pc1014 to populate
[13:22:52] <kormat>	 (correct me if i'm wrong :))
[13:23:00] <godog>	 Emperor: I'd expect so yeah
[13:23:24] <Emperor>	 godog: I might just be impatient, but ms-be2040 reinstall seems to have just hung :(
[13:23:43] <marostegui>	 kormat: I would not wait no
[13:26:56] <godog>	 Emperor: mmhh sth that comes to mind (thinking out loud) is old hardware + firmware and newer kernels, is it hanging pre d-i ?
[13:27:25] <kormat>	 old hardware... could have buggy firmware. we've run into that a bunch of times.
[13:28:00] <godog>	 aye, this host is from early 2018 heh
[13:28:13] <Emperor>	 godog: yes, it's hanging really early in the boot
[13:28:41] <Emperor>	 godog: gets to "Probing EDD (edd=off to disable)... ok" and that's it
[13:29:08] <jynus>	 something like that happened to me-I was unable to PXE boot it, although I didn't remember at what point it couldn't load
[13:30:03] <godog>	 I searched 'probing edd' on phab and got back this https://phabricator.wikimedia.org/T260370#6598624 though no mention on what was wrong and/or how to fix it
[13:31:03] <Emperor>	 I'm starting to despair of ever getting swift upgraded :(
[13:31:34] <volans>	 in many cases a firmware upgrade helped, if it's the same issue
[13:31:41] <volans>	 you can ask dcops to do it
[13:32:56] <Emperor>	 I guess I can try that; it's a lot of hosts if they all need firmware upgrades
[13:35:19] <Emperor>	 [is there a specific form of ticket for that, or just a regulat phab task? e.g. hardware decom has a particular form to fill out; wikitech search suggests not, but...]
[13:36:19] <kormat>	 Emperor: an example task: T286226
[13:36:20] <stashbot>	 T286226: Upgrade db1104 firmware - https://phabricator.wikimedia.org/T286226
[13:45:55] <Emperor>	 I assume these hosts must have been PXE-bootable in the past, I guess it was optimistic to think that meant they would still be PXE-bootable :-/
[13:52:26] <volans>	 Emperor: that's being delusional :D
[13:53:52] * Emperor has made T306988 
[13:54:34] <Emperor>	 I mean, I kindof new this upgrade task was going to be can of bizarre yak-worm hybrids, but still
[13:55:53] <kormat>	 verily, you are wise
[13:57:19] <Emperor>	 it's not like they all go out of security support soon or anything...
[14:00:31] <Emperor>	 kormat: I'll save that comment for my next ITC ;-)
[14:00:55] <kormat>	 Emperor: you might want to drop the attribution, if you want it to _help_ you. ;)
[14:01:06] <Emperor>	 lol
[15:12:24] <jynus>	 dbstore1005 is alerting, btullis
[15:13:33] <btullis>	 jynus: Thanks. Looking into it now.
[15:14:14] <jynus>	 maybe restarting replication was just missed for x1 after restart
[15:18:59] <btullis>	 I didn't know that `skip-slave` was the default :-)
[15:20:15] <jynus>	 AFAIK it was prefered like that for production (for crashes), but I belive it is configurable