[08:20:24] <Emperor>	 I'm trying to bring ms-be2045 back into service, and puppet is failing ( https://phabricator.wikimedia.org/P17432 ). This is because the swift filesystems were wiped (wipefs -a IIRC) when we thought the h/w was doomed. Now I could recreate the filesystems by hand, but I'm trying to work out why puppet is failing. AFAICS, swift::init_device should make the filesystem "unless  => "xfs_admin -l ${dev}"" But if I do "sudo xfs_admin -l
[08:20:24] <Emperor>	 /dev/sda3" then I get an error message and a RC of 1. I guess I'm asking _should_ this work, if not should I be re-mkfsing myself, or is there something else I should be doing here?
[08:20:36] <Emperor>	 hopefully godog knows the answer :)
[08:21:46] <Emperor>	 (I see why Swift::Label_filesystem is failing, I'm not so clear why swift::init_device isn't making a new fs on these devices)
[08:23:41] <godog>	 Emperor: yes it should work, i.e. puppet will recreate partitions and filesystems when they are missing
[08:23:48] <godog>	 I'm taking a look too
[08:24:40] <godog>	 init_device isn't making a new fs I think because xfs_admin -l fails
[08:26:35] <godog>	 Emperor: an alternative route would be to reimage the host and that's it, probably simpler at this point
[08:30:10] <Emperor>	 I'm misunderstanding, then, - I thought the unless clause meant that it would try to mkfs when xfs_admin returns non-zero?
[08:33:15] <godog>	 ah yeah nevermind I misunderstood too
[08:34:02] <Emperor>	 godog: reimage would be 'sudo -E wmf-auto-reimage --no-verify -p T290881 ms-be2045.codfw.wmnet' from cumin2002 ? [sorry, still a bit new, want to check before destroying things]
[08:34:03] <stashbot>	 T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881
[08:35:09] <godog>	 Emperor: it used to be that yes, my understanding is that it changed recently and now reimaging is a cookbook (never tried to run yet though)
[08:35:16] <godog>	 I'm looking at https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage
[08:36:49] <RhinosF1>	 godog: I think volan.s has said the same commands should work just replacing wmf-auto-reimage with the cookbook
[08:37:04] <godog>	 re: mkfs and labels the 'gotcha' there is that sda/sdb are formatted by d-i, therefore puppet leaves those alone as far as formatting goes, though that's debatable whether it is the right to do
[08:37:18] <godog>	 RhinosF1: ack thanks! that's useful
[08:37:39] <RhinosF1>	 godog: I would double check what the docs say but I think I read that
[08:37:39] <godog>	 but yes one more reason to just reimage the host
[08:37:44] <Emperor>	 Sigh, here was me thinking I had useful experience of doing a reimage before :)
[08:38:08] <RhinosF1>	 You don't need .codfw.wmnet though
[08:38:39] <godog>	 hehe
[08:38:40] <Emperor>	 should I be running from a cumin server still?
[08:39:02] <RhinosF1>	 'sudo cookbook sre.hosts.reimage --no-verify -p T290881 ms-be2045'
[08:39:06] <RhinosF1>	 Emperor: believe so
[08:39:17] <RhinosF1>	 That's my guess of the new command from docs and being nosey
[08:39:19] <godog>	 Emperor: yes that's my understanding
[08:39:40] <godog>	 I usually use cumin1001 FWIW
[08:39:43] <RhinosF1>	 Because that server will already have DHCP entry
[08:39:49] <Emperor>	 there's a mandatory --os option, which is stretch
[08:41:15] <volans>	 Emperor: so if yu want to use the new way not yet fully streamlined, it's sre.experimental.reimage
[08:41:21] <volans>	 --os is not yet mandatory
[08:41:28] <volans>	 as your host has the hardcoded DHCP
[08:41:31] <volans>	 already in puppet
[08:42:11] <volans>	 we're literally in the migration days, so both old and new method works ;)
[08:42:27] <volans>	 and if you use cumin2002 it's slightly quicker
[08:42:30] <volans>	 being same DC
[08:42:41] <volans>	 but it's absolutely the same between cumin1001 and cumin2002
[08:42:47] <Emperor>	 volans: I was just working my way through the "huh, sre.hosts.reimage doesn't exist, wait there's sre.experimental.reimage" process, so thanks :)
[08:43:33] <godog>	 mmh I think I confused things more than helped, apologies Emperor 
[08:43:34] <volans>	 I'm writing the wikitech docs as we speak and planning to send it over today via mailing lists and flip the switch on monday at this point
[08:45:49] <volans>	 there is no more --no-verify option as the cookbook is smarter to detect that
[08:46:01] <Emperor>	 So I think "sudo cookbook sre.experimental.reimage --os stretch -t T290881 ms-be2045" on cumin2002 [does it start a screen for me] - look ok volans / godog?
[08:46:02] <stashbot>	 T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881
[08:46:37] <volans>	 Emperor: you have to start your own tmux/screen and no need for --os, it will be noop in your case
[08:47:12] <volans>	 I'd like you to run it and it should fail earlier telling you that the host has disappeared from puppetdb and that you should use --new, but would be nice if you could test this scenario
[08:47:45] <volans>	 (and if you don't open a tmux/screen it will fail earlier telling you that btw)
[08:48:04] <Emperor>	 OK, let's see...
[08:50:42] <Emperor>	 rebooted via IPMI, waiting for it to come up 
[08:54:08] <Emperor>	 [:tumbleweed:]
[08:54:30] * volans drum rolling
[08:54:43] <volans>	 Emperor: did the check for --new worked?
[08:54:44] <Emperor>	 27/120 retries...
[08:54:58] <Emperor>	 volans: it didn't require --new
[08:55:45] <Emperor>	 (perhaps because I'd re-enabled puppet on this system earlier?)
[08:56:50] <Emperor>	 disable-puppet succeeds although puppet ca --disable deletes nothing
[08:56:54] <volans>	 ahhh yes
[08:57:07] <volans>	 if you re-enabled puppet it got re-added to puppetdb, was no more orphan
[08:57:27] <Emperor>	 thought that might be it :)
[09:40:13] <Emperor>	 godog: ms-be2045 is up; not all the swift fss are empty (e.g. sdc1 is 74% full), and sdl1 is sad (mount fails "Structure needs cleaning" and xfs_repair /dev/sdl1 says "you ought to mount the FS to replay the log before running this"...
[09:41:52] <Emperor>	 oh, and now it's gone down (crashed?)
[09:44:27] <godog>	 siiiigh
[09:45:11] <godog>	 re: the filesystems I'm guessing at the time the drives were unreachable and didn't get wiped
[09:45:36] <godog>	 it is task-reopen time, looks like the host crashed indeed
[09:46:22] <godog>	 whew host is one month away from the 3y warranty
[09:46:29] <Emperor>	 racadm getsel has nothing in this time
[09:47:08] <Emperor>	 let's see if it comes back up
[09:47:25] <Emperor>	 (but yes, inclined to think reopen task, make papaul happy...)
[09:48:35] <godog>	 hehe
[09:49:35] <Emperor>	 Hm, have a login prompt on console com2 but its not pingable
[09:50:02] <Emperor>	 and errors from the drives
[09:50:07] <Emperor>	 not drives, net devices
[09:51:08] <Emperor>	 OK, on as root from console, all network interfaces down
[09:51:20] <Emperor>	 so it didn't crash, but all its networks went away
[09:51:26] <Emperor>	 (uptime 32 minutes)
[09:52:20] <godog>	 :(
[09:52:36] <Emperor>	 let me see if I can extract this kernel barf-o-gram
[09:58:12] <Emperor>	 godog: https://phabricator.wikimedia.org/P17433 (and it's also now unhappy about the XFS on sdc1 too)
[09:58:57] <Emperor>	 tempted to try rebooting and see if the network comes back? But this looks like a h/w problem to me?
[09:59:50] <godog>	 I concur, very much looks like busted hw
[10:00:01] <godog>	 sure feel free to reboot/wipe at will
[10:04:58] <Emperor>	 if I'm passing T290881 back to papaul (which is now done), should I do anything else to the system or just leave it as-is? I know you wiped the non-os drives and powered off last time...
[10:04:59] <stashbot>	 T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881
[10:05:56] <Emperor>	 (but given it never made it back into service, perhaps that is unnecessary?)
[10:08:26] <godog>	 yeah with the host out of swift rings we can leave it as-is
[10:08:36] <Emperor>	 cool.
[16:08:27] <greg-g>	 mut<tab> :(
[16:08:51] <kormat>	 😭
[16:10:28] <kormat>	 mutante: 🎉 🍾 🎆
[16:10:36] <mutante>	 :)
[19:40:15] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 57.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[19:49:01] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 283 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[19:52:47] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[19:55:19] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321