[08:20:24] I'm trying to bring ms-be2045 back into service, and puppet is failing ( https://phabricator.wikimedia.org/P17432 ). This is because the swift filesystems were wiped (wipefs -a IIRC) when we thought the h/w was doomed. Now I could recreate the filesystems by hand, but I'm trying to work out why puppet is failing. AFAICS, swift::init_device should make the filesystem "unless => "xfs_admin -l ${dev}"" But if I do "sudo xfs_admin -l [08:20:24] /dev/sda3" then I get an error message and a RC of 1. I guess I'm asking _should_ this work, if not should I be re-mkfsing myself, or is there something else I should be doing here? [08:20:36] hopefully godog knows the answer :) [08:21:46] (I see why Swift::Label_filesystem is failing, I'm not so clear why swift::init_device isn't making a new fs on these devices) [08:23:41] Emperor: yes it should work, i.e. puppet will recreate partitions and filesystems when they are missing [08:23:48] I'm taking a look too [08:24:40] init_device isn't making a new fs I think because xfs_admin -l fails [08:26:35] Emperor: an alternative route would be to reimage the host and that's it, probably simpler at this point [08:30:10] I'm misunderstanding, then, - I thought the unless clause meant that it would try to mkfs when xfs_admin returns non-zero? [08:33:15] ah yeah nevermind I misunderstood too [08:34:02] godog: reimage would be 'sudo -E wmf-auto-reimage --no-verify -p T290881 ms-be2045.codfw.wmnet' from cumin2002 ? [sorry, still a bit new, want to check before destroying things] [08:34:03] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [08:35:09] Emperor: it used to be that yes, my understanding is that it changed recently and now reimaging is a cookbook (never tried to run yet though) [08:35:16] I'm looking at https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage [08:36:49] godog: I think volan.s has said the same commands should work just replacing wmf-auto-reimage with the cookbook [08:37:04] re: mkfs and labels the 'gotcha' there is that sda/sdb are formatted by d-i, therefore puppet leaves those alone as far as formatting goes, though that's debatable whether it is the right to do [08:37:18] RhinosF1: ack thanks! that's useful [08:37:39] godog: I would double check what the docs say but I think I read that [08:37:39] but yes one more reason to just reimage the host [08:37:44] Sigh, here was me thinking I had useful experience of doing a reimage before :) [08:38:08] You don't need .codfw.wmnet though [08:38:39] hehe [08:38:40] should I be running from a cumin server still? [08:39:02] 'sudo cookbook sre.hosts.reimage --no-verify -p T290881 ms-be2045' [08:39:06] Emperor: believe so [08:39:17] That's my guess of the new command from docs and being nosey [08:39:19] Emperor: yes that's my understanding [08:39:40] I usually use cumin1001 FWIW [08:39:43] Because that server will already have DHCP entry [08:39:49] there's a mandatory --os option, which is stretch [08:41:15] Emperor: so if yu want to use the new way not yet fully streamlined, it's sre.experimental.reimage [08:41:21] --os is not yet mandatory [08:41:28] as your host has the hardcoded DHCP [08:41:31] already in puppet [08:42:11] we're literally in the migration days, so both old and new method works ;) [08:42:27] and if you use cumin2002 it's slightly quicker [08:42:30] being same DC [08:42:41] but it's absolutely the same between cumin1001 and cumin2002 [08:42:47] volans: I was just working my way through the "huh, sre.hosts.reimage doesn't exist, wait there's sre.experimental.reimage" process, so thanks :) [08:43:33] mmh I think I confused things more than helped, apologies Emperor [08:43:34] I'm writing the wikitech docs as we speak and planning to send it over today via mailing lists and flip the switch on monday at this point [08:45:49] there is no more --no-verify option as the cookbook is smarter to detect that [08:46:01] So I think "sudo cookbook sre.experimental.reimage --os stretch -t T290881 ms-be2045" on cumin2002 [does it start a screen for me] - look ok volans / godog? [08:46:02] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [08:46:37] Emperor: you have to start your own tmux/screen and no need for --os, it will be noop in your case [08:47:12] I'd like you to run it and it should fail earlier telling you that the host has disappeared from puppetdb and that you should use --new, but would be nice if you could test this scenario [08:47:45] (and if you don't open a tmux/screen it will fail earlier telling you that btw) [08:48:04] OK, let's see... [08:50:42] rebooted via IPMI, waiting for it to come up [08:54:08] [:tumbleweed:] [08:54:30] * volans drum rolling [08:54:43] Emperor: did the check for --new worked? [08:54:44] 27/120 retries... [08:54:58] volans: it didn't require --new [08:55:45] (perhaps because I'd re-enabled puppet on this system earlier?) [08:56:50] disable-puppet succeeds although puppet ca --disable deletes nothing [08:56:54] ahhh yes [08:57:07] if you re-enabled puppet it got re-added to puppetdb, was no more orphan [08:57:27] thought that might be it :) [09:40:13] godog: ms-be2045 is up; not all the swift fss are empty (e.g. sdc1 is 74% full), and sdl1 is sad (mount fails "Structure needs cleaning" and xfs_repair /dev/sdl1 says "you ought to mount the FS to replay the log before running this"... [09:41:52] oh, and now it's gone down (crashed?) [09:44:27] siiiigh [09:45:11] re: the filesystems I'm guessing at the time the drives were unreachable and didn't get wiped [09:45:36] it is task-reopen time, looks like the host crashed indeed [09:46:22] whew host is one month away from the 3y warranty [09:46:29] racadm getsel has nothing in this time [09:47:08] let's see if it comes back up [09:47:25] (but yes, inclined to think reopen task, make papaul happy...) [09:48:35] hehe [09:49:35] Hm, have a login prompt on console com2 but its not pingable [09:50:02] and errors from the drives [09:50:07] not drives, net devices [09:51:08] OK, on as root from console, all network interfaces down [09:51:20] so it didn't crash, but all its networks went away [09:51:26] (uptime 32 minutes) [09:52:20] :( [09:52:36] let me see if I can extract this kernel barf-o-gram [09:58:12] godog: https://phabricator.wikimedia.org/P17433 (and it's also now unhappy about the XFS on sdc1 too) [09:58:57] tempted to try rebooting and see if the network comes back? But this looks like a h/w problem to me? [09:59:50] I concur, very much looks like busted hw [10:00:01] sure feel free to reboot/wipe at will [10:04:58] if I'm passing T290881 back to papaul (which is now done), should I do anything else to the system or just leave it as-is? I know you wiped the non-os drives and powered off last time... [10:04:59] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [10:05:56] (but given it never made it back into service, perhaps that is unnecessary?) [10:08:26] yeah with the host out of swift rings we can leave it as-is [10:08:36] cool. [16:08:27] mut :( [16:08:51] 😭 [16:10:28] mutante: 🎉 🍾 🎆 [16:10:36] :) [19:40:15] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 57.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [19:49:01] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 283 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [19:52:47] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [19:55:19] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321