[10:57:50] taavi: Ah of course. I thought it was like the puppet compiler where we used similar-ish data. Thanks for the tip. [11:59:05] 10netops, 10Infrastructure-Foundations, 10Traffic: Network issues for users in the UK and Ireland - https://phabricator.wikimedia.org/T354065 (10cmooney) p:05Triage→03Low >>! In T354065#9433535, @Sideswipe9th wrote: > Hey. > > I'm the user from Northern Ireland who sent the email cmooney copied above.... [12:20:41] 10netops, 10Infrastructure-Foundations, 10SRE: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 (10cmooney) p:05Triage→03Medium [12:22:24] 10netops, 10Infrastructure-Foundations, 10SRE: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 (10cmooney) [12:53:29] o/ I've been fighting the reimage cookbook and I have Questions [12:53:55] the problem is that when converting some old HW from MW servers to k8s nodes, the last reboot takes forever and the cookbook times out [12:55:47] the question is, where are the retry params for the reboot wait set in the cookbook? I've tried going down the rabbit hole of the retry decorator but they're not any of the defaults and I don't see anything overriding them [14:04:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10ayounsi) 05Open→03Resolved a:03ayounsi Depooled esams for 1h and everything went well. [14:29:15] kamila_: do you have a hostname I can check the logs for? [14:30:05] volans: any of mw13[77-83].eqiad.wmnet in theory, but in practice I've reimaged them to insetup in the meantime [14:30:33] (and then rebooted mw1377 after changing the role back to k8s worker and it's been rebooting for 2.5h and still isn't up...) [14:30:53] so the first reimage of any of those, ok I can have a look [14:31:06] thank you, much appreciated [14:31:44] as for the specific question, when calling wait_reboot_since() the parameters are defined in spicerack here: [14:31:47] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/remote.py#550 [14:33:06] and you can calculate the total time it will wait using the formulas described in the docs for the backoff_mode parameter in https://doc.wikimedia.org/wmflib/master/api/wmflib.decorators.html#wmflib.decorators.retry [14:33:24] oh, thanks, I was looking at the wrong wait_reboot_since '^^ [14:37:09] I don't know why it takes forever, but if that can't be fixed, would it make sense to add a parameter to the cookbook and use spice.decorators.set_retries() or something to override it? [14:37:39] that's totally not normal, a reboot takes few minutes [14:37:43] clearly something's wrong here [14:37:47] yeah, okay [14:38:16] have you tried to log in via mgmt during the wait? [14:38:39] I get permission denied [14:38:54] to those mgmt or any mgmt? [14:39:56] apparently any [14:41:20] oh, for context, the timeout happens for all 7 of those hosts in eqiad, does not happen in for other hosts in codfw or eqiad that I've reimaged before [14:41:23] then you have to fix your ssh config or reinstall the wmf laptop package if on linux [14:41:26] ack [14:42:29] mw1377's console is currently black, no root's login prompt or anything [14:42:42] 10netops, 10Infrastructure-Foundations, 10SRE: Cannot enter configuration mode on cr2-drmrs - https://phabricator.wikimedia.org/T354340 (10cmooney) 05Open→03Resolved Problem has resolved following device reboot. It looked like killing the mgd processes in "lockf" state was working, but I made an error a... [14:43:53] if you want I can force another reboot and we can watch it from the console [14:44:41] I suppose it can't make things much worse... :D [14:45:32] though, maybe using say mw1378 would be better [14:45:41] as you want [14:45:43] it should be in the same state [14:45:51] let me check [14:45:56] I'm curious to see if 1377 comes up in a working state eventually [14:46:31] I can login to mw1378 [14:46:40] yes, I haven't rebooted that one yet [14:46:47] but have reimaged to insetup, changed role and run puppet [14:47:28] so my theory is that if we try to reboot it, it'll get stuck [14:47:46] because it should be the same as 1377 was before [14:48:04] at which time did you changed the role and run puppet? [14:48:22] for mw1378 [14:48:25] around 12:30Z I guess? [14:48:39] don't know exactly, sorry :-/ [14:48:57] actually should be shortly after I merged it [14:48:58] which was... [14:48:59] because I see that there was a failed run at 12:46 and then it took another 2 runs with changes before starting to be a noop [14:49:02] https://puppetboard.wikimedia.org/node/mw1378.eqiad.wmnet [14:49:36] mhm, 12:46 sounds like that might have been the one I triggered [14:49:38] hm [14:50:21] wait, so is the waiting about waiting for puppet to settle? [14:50:24] that'd be sad [14:50:36] no I don't think so, console is blank [14:50:41] right [14:51:01] do I need to downtime mw1378 or is it already downtimed? [14:51:08] * volans assuming is not pooled [14:52:03] it's not pooled, should be downtimed too [14:52:11] unless it expired in the meantime [14:52:13] yeah, it did [14:56:16] ack downtimed for 4h [14:57:11] thanks [14:58:07] I see it inactive in conftool for kubesvc, can I assume it was already removed from mediawiki pools? [14:58:34] it's inactive there [14:58:42] ok [14:58:46] rebooting it then [15:00:21] rebooting [15:03:26] [57118.905101] watchdog: watchdog0: watchdog did not stop! [15:03:36] then it goes through the bios reboot [15:03:41] and then [15:03:47] UEFI0082: The system was reset due to a timeout from the watchdog timer. [15:03:50] Check the System Event Log (SEL) or crash dumps from Operating System to [15:03:53] identify the source that triggered the watchdog timer reset. Update the [15:03:56] firmware or driver for the identified device. [15:04:11] and then it presents a menu [15:04:12] Available Actions: [15:04:13] F1 to Continue and Retry Boot Order [15:04:14] ... [15:04:28] trying F1 [15:04:59] it's booting now [15:05:13] reboot donw [15:05:19] host accessible [15:05:25] kamila_: ^^^ [15:05:38] huh [15:05:42] weird [15:05:55] I have no idea what's happening [15:05:58] thank you [15:06:06] if the hardware is not too old we can try to run the provision cookbook to see if they have some wrong settings at the bios/idrac level [15:06:17] I guess [15:06:28] it should be PowerEdge R440 [15:06:36] from 2019-09 [15:06:41] I'll try new reboot first [15:06:53] to see if it's one time or alwaays [15:10:19] it's reproducible, trying the cookbook in a minute [15:10:31] thanks a lot <3 [15:10:33] (wtf) [15:11:28] also.. I've a question for you, k8s hosts are provisioned with virtualization enabled or not? [15:11:53] because that's set by the provision cookbook and mw hosts have that disabled ofc [15:12:50] I don't think it's needed but just double checking [15:13:44] good question [15:13:50] I have no idea [15:14:42] could you verify it maybe with your team? [15:16:55] * kamila_ asked [15:17:11] thx running it without for now, we can re-run if needed [15:17:46] doh, found an unexpected bug for cumin1002, running from 2002 [15:18:25] moritzm: we need to fix one thing for the provision cookbook, it tests the mgmt password trying to connect to the same host you run the cookbook from [15:18:35] and 1002 is a VM, so no redfish... :D [15:18:59] I need to check if other cookbooks have the same logic [15:19:27] ah, good catch! [15:20:26] volans: should be fine without virtualization, unless I'm wrong [15:20:39] ack [15:21:01] it doesn't seem to set any value that should affect the issue we're having, but let's see [15:21:58] something that shouldn't be relevant but who knows is that the k8s nodes have a different partitioning setup [15:22:39] https://gerrit.wikimedia.org/g/operations/puppet/+/refs/changes/45/984645/2/modules/profile/data/profile/installserver/preseed.yaml#349 [15:23:20] I don't think I screwed that up because the other nodes boot [15:23:43] just mentioning because it's all weird [15:23:45] it boots pressing f1 so yeah that shouldn't be the problem [15:23:48] right [15:29:44] ok cookbook finished, retrying a reboot although I don't think it changed anything, next step would be upgrading firmware [15:30:14] hm [15:30:25] this is booting from disk, not network, right? [15:30:30] yes [15:30:40] there is also the k8s having a different network setup, but then that should be irrelevant too [15:30:52] indeed [15:31:13] I can put the netbox bgp flag back to normal just to check, if you want [15:31:22] but yeah, shouldn't matter [15:31:24] nah it's way before network [15:31:26] ok [15:31:29] at bios time [15:31:29] nvm then [15:34:09] from googling it might be an issue with an older bios, I'll try to run the firmware upgrade cookbook for the bios on mw1378 [15:34:17] ok, thanks, sounds good [15:37:52] soooo... are all the mw servers in that batch (including the ones I didn't reimage) time bombs now? '^^ [15:38:10] just asking :D [15:38:24] possibly [15:38:37] :D [15:41:05] kamila_: you can check the bios and idrac versions also in puppetdb or via facter with cumin [15:41:15] ok, on it, thanks [15:41:38] bios_version and firmware_idrac [15:41:42] facts [15:42:54] kamila_: in general for this kind of repurpose (and on reimages even more in general) I think it would be good to do firmware upgrades. But also please get in touch with dcops to check what they suggest. [15:43:30] right... [15:44:09] BIOS update was successful. [15:45:21] cool [15:45:40] thanks [15:47:24] trying a new reboot, not sure if we need a cold boot though [15:47:29] testing a normal reboot first [15:49:17] thank you [15:50:16] mmh no luck and facter reports the old bios version still... trying a cold one [15:50:23] mhm [15:52:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: cr1-esams:fpc0 errors - https://phabricator.wikimedia.org/T346779 (10ayounsi) a:03ayounsi Error logs stopped showing up after the linecard reboot. Monitoring it for a bit before closing the task. [15:57:31] testing reboot after a cold power on (that worked fine) [15:59:20] ok, so it was that? [15:59:53] :( no luck, sme issue, let me try to upgrade idrac too while at it and then I'll dig a bit more on the logs [16:00:06] eh :-/ [16:00:18] in that case I won't reboot a random mw server just yet :D [16:00:44] thank you for looking into it <3 [16:01:15] and lmk if I can help with anything, I'm clueless but eager (run!) :D [16:02:58] idrac is too old for the automatic update needs manual update [16:03:35] ugh :-/ [16:04:04] have you fixed your mgmt access in the meanwhile? It's better to have it working before doing any reboot/reimage in general ;) [16:05:44] it worked once... [16:06:22] and then I edited ssh config and got distracted :D [16:06:26] oops :D [16:06:32] lol [16:30:25] FYI I did a quick dirty hack to have the upgrade firmware cookbook work in this case and I'm upgrading idrac now, let's see if the combo helps [16:31:23] thanks for letting me know [16:33:54] but starting to run out of ideas... one possible next step could be to give it to dcops and see if they have suggestions [16:39:10] ok idrac updated, last test [16:40:22] shutdown the host from OS, run racadm serveraction hardreset and let's see what happens [16:42:20] host up, let's see if a final reboot does work or not [16:42:40] * kamila_ crosses fingers [16:42:51] weird that the bios fact still has the old value, I can't explain that [16:43:01] I saw the new version in the console output [16:43:01] huh [16:43:15] * volans finger crossed [16:43:35] volans: what does dmidecode show, same thing for bios version? [16:43:37] just how flexible is your finger?! [16:44:07] lol, apparently very! :D [16:44:34] jhathaway: I'll check as soon as the host comes back I didn't dig that bit yet, E_TOO_MANY_BRANCHES in this rabbit hole [16:44:41] :) [16:44:45] but from console BIOS Version: 2.17.1 [16:44:51] that's clearly correct and the new one [16:45:01] and yet the reboot get stuck [16:45:11] unless I press F1 sad_trombone.wav [16:45:13] eh :-/ [16:46:15] jhathaway: dmidecode is correct [16:46:28] so maybe so bogus facter caching? [16:46:29] facter's bios_version is wrong [16:46:30] *some [16:46:33] could be [16:46:48] maybe we do have longer cache on those to avoid hitting the BMC all the time [16:46:51] and it would also make sense [16:46:56] so yeah it's probably that [16:47:20] still surpising after a reboot, sounds buggy [16:47:35] ok [16:47:43] I'll leave that bit to you :D [16:47:47] :) [16:48:23] fwiw dmi.bios has the same values than the legacy names [16:48:35] I didn't check how that's populated [16:49:07] wait... I kinda forgot that the machines were fine when I reimaged into the insetup role [16:49:31] even after they got stuck after my first reimage attempt [16:49:43] kamila_: I'm sorry I'm out of ideas for today, mw1378 has a recent idrac and bios (not latest because I picked from the list of already installed version to avoid bleeding edge bugs) [16:49:55] ack [16:49:58] it's consistent though at every reboot [16:50:08] and the message is always the same I reported earlier here [16:50:17] thank you <3 [16:50:31] I'll go poke at it some more [16:50:54] Id ask dcops to have a look and also investigate more the uefi error reported [16:51:12] lmk if you have enough info to update the task or need more [16:51:47] ok, but I'll first check that all 7 machines are the same and also see what happens if I reimage with a different role... [16:52:00] thanks a lot, really <3 [16:52:00] ok [16:52:13] what's the task? [16:53:58] I assumed T351074 :) [16:53:58] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:54:17] oh, right, I thought you'd made a new one, sorry [16:55:39] no as it wasn't clear what the issue was sorry [16:56:15] np, I was just confused [16:56:27] (and it's still not clear :-D)