[09:31:45] apergos: hey, is it ok if I remowe candrew from the ops-dums list? it's bouncing emails with account not found [09:34:56] sure [09:35:23] who's on there from wmcs? we ought to make sure there's someone [09:35:33] dcaro: [09:37:29] apergos: there's the whole team xd [09:37:37] ok then :-D [09:39:18] 👍 done [14:23:00] klausman: you haven't puppet-merged your patch. On it? [14:23:22] in https://gerrit.wikimedia.org/r/labs/private [14:23:23] I was about to and then you showed up :) Just merge it along with yours :) [14:23:53] awesome. Done now :D [14:26:07] Merci! [16:04:38] volans, we're in a maintenance window and tripping over T304434; iirc the workaround is to run with --no-pxe and hope for the best the second time around? [16:04:38] T304434: reimage cookbook failure due to ipmi settings - https://phabricator.wikimedia.org/T304434 [16:05:56] I'm trying that and now it seems stuck on 'Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title cloudnet1003 not found yet' [16:13:40] andrewbogott: sorry, in a meeting, will look in a bit [16:13:56] np, I think I can work around it for the moment [16:30:13] andrewbogott: back, sorry [16:30:38] so what's the problem? [16:32:44] the 0004000000 flag is the force PXE boot [16:34:47] volans: ok, but when that host reboots it doesn't pxe boot [16:34:52] possible the flags are different for hp? [16:35:49] it should be vendor-agnostic as it's IPMI, I can check though [16:35:58] did you solve the immediate problem or are you currently blocked? [16:36:50] andrewbogott: cannot help a lot, but in the past I had some hw complain while doing the right thing [16:37:29] I'm not currently blocked although we're on the verge of failing over network service to that host at which point we won't be able to test anymore [16:37:31] hm... [16:37:37] Let me reboot again and see if it's trying to pxe and failing [16:37:53] the last reimage seems to have passed [16:38:10] so I guess it should be ok now, the reimage already does a reboot after the first puppet run [16:38:38] yeah, but I want to make sure it's not trying pxe and then fallling through due to no dhcp [16:38:40] * andrewbogott doing that now [16:38:44] k [16:40:21] nope, went straight to grub [16:40:54] so, I'm going to move ahead with our maintenance. volans you can probably check the impi settings on that server (cloudnet1003.eqiad.wmnet) to see what's going on but please don't reboot it :) [16:41:06] ack [17:12:06] volans: which host are you talking about? [17:12:37] cloudnet1004 [17:13:20] ok. So you think there's a real bug here where the bit isn't unset? [17:13:36] IMO it nevertheless doesn't try to pxe boot when in this state [17:14:05] it might be a different behaviour from HP, that doesn't automatically unset it on reboot, I'm checking [17:14:22] (as evidenced by the reimage working properly so far, it seems to be doing puppet things) [17:15:45] andrewbogott: I've manually unset it so to not let your reimage fail [17:15:52] will investigate a bit more around [17:15:55] thanks :) [17:31:27] andrewbogott: sorry, got some issues with my bouncer, I don't have any reply you might have sent to me [17:31:53] I didn't say anything other than 'thanks' [17:32:22] still waiting on initial puppet run [17:32:33] ack :) [17:47:42] volans: reimage is finished, all good. [17:47:51] perfect [17:48:07] That doesn't leave you with a good test case for T304434 but I'm happy :) [17:48:08] T304434: reimage cookbook failure due to ipmi settings - https://phabricator.wikimedia.org/T304434 [17:48:41] yeah but I have a theory and I can test it on any HP host, so I'll find one less critical for that [17:50:22] ok