[08:40:52] I have a host which as soon as it finishes the debian installer, it reboots and goes again into the debian installer....have anyone seen this before? [08:55:15] haven't seen that so far, no. is that a new model or a machine type we already use? [08:56:36] No, it is a normal host, I've reimaged plenty of them [08:56:38] It is a dbproxy [09:00:01] It sounds like some issue/failure in setting (or resetting) the boot order via IPMI? the reimage cookbook configures a one time selection of PXE and it sounds like in this case the the "only select PXE once" aspect fails [09:00:27] it's probably best if Riccardo has a look when he's around [09:06:34] moritzm: Yeah, I will wait. It is interesting cause when I hit abort after after Unable to verify that the host rebooted into the new OS, it might still be in the Debian installer, please verify manually with: sudo install-console dbproxy1022.eqiad.wmnet, then the installer finishes and the host reboots into the OS [09:07:08] I could probably complete the steps manually after I get to the OS via install_console, but I'd rather get this figured out first [09:09:56] I am going to manually remove the pxe override and see what happens once this debian installer run finishes [09:12:00] marostegui: you can try this manually when D-I is running to narrow down the issue: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Changing_BIOS_Boot_Order [09:12:10] so when it reboots it goes to the disk [09:13:52] Yeah I just did it via IPMI tool [09:14:04] thanks XioNoX [09:21:45] I just logged in via the install console and can confirm it's currentls still in d-i, installing base packages ATM [09:22:57] yeah [09:23:06] And it looks like forcing an override doesn't work [09:23:18] I still get Boot parameter data: 8000020000 [09:23:40] After executing the "boodev none" which, returns correctly [09:23:44] Password: [09:23:45] Set Boot Device to none [09:23:45] XD [09:24:44] cdanis: re: pontoon bootstrap, it'll need some work to run on bookworm/puppetserver, I've filed https://phabricator.wikimedia.org/T352640 [09:26:28] Who owns reviewer-bot? It just added _ joe_ and j.bond as reviewers to https://gerrit.wikimedia.org/r/c/operations/puppet/+/979893 and probably should no longer be adding john :-/ [09:26:34] * volans reading backlog marostegui [09:27:50] Emperor: that's based on https://www.mediawiki.org/wiki/Git/Reviewers#operations/puppet [09:28:03] it's opt-in and people manages their own [09:29:11] volans: I think we've identified that the PXE settings are the ones creating issues [09:29:34] volans: OK, well if he still wants spamming with my rubbish puppet changes, he's welcome to keep getting the emails :) [09:30:44] marostegui: the cookbook does check that the bios params are back to normal should log if not IIRC [09:30:47] checking [09:31:19] volans: it does [09:31:22] volans: but they aren't [09:31:27] from what I can see with a get 5 manually [09:31:30] mmmh [09:31:30] 2023-12-04 08:55:17,950 marostegui 444718 [INFO] Checked BIOS boot parameters are back to normal [09:31:36] yeah but they aren't [09:31:42] what's the value? [09:31:47] Boot parameter data: 8000020000 [09:32:02] According to wikitech: In case of overrides the Boot parameter data bitmask will be different from 0000000000 and the line below will show the overridden values. [09:32:18] that's not entirely accurte [09:33:13] marostegui: see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/ipmi.py#18 [09:33:22] those are unrelated bits [09:33:41] Then the question is back to...why is pxebooting all the time? [09:33:45] if it shouldn't? [09:34:10] that's because the hardware is not "getting" what IPMI is setting apaprently, it happened to me few times with particularly old hardware [09:34:30] volans: Should I try an ipmi restart? [09:34:37] ths is a new host, wierd [09:34:49] I'd try to reset the ipmi yeah [09:34:53] and see if that helps [09:34:58] ok, let me do that [09:39:32] Boot parameter data: 0000000000 [09:39:55] at least something changed :D [09:40:00] let's see [09:40:12] yeah, still finishing the installer [09:40:16] let's see on reboot [09:40:33] is the reimage cookbook still ongoing or did it timeout? [09:40:50] it timed out, but I will hit "go" once it is on the OS login [09:41:29] +1 [09:41:31] thx [09:45:30] nope [09:45:34] It went again into the installer [09:45:37] and Boot parameter data: 8000020000 [09:45:39] wtf? [09:46:41] let me see to check things from another angle [09:47:00] so I bet that the host has the fixed value to boot into pxe [09:47:10] the one we set via ipmi is just the override one [09:47:23] marostegui: can I try one thing? [09:47:41] volans: you can try anything you want [09:47:58] * volans trying the provision cookbook to see if it would want to change any value [09:48:04] and then checking via redfish the other params [09:48:40] marostegui: dbproxy1022 right? [09:48:45] (double checking) [09:48:53] yep [09:50:16] Updated value for attribute BIOS.Setup.1-1 -> SetBootOrderEn: NIC.Embedded.1-1-1,HardDisk.List.1-1 => HardDisk.List.1-1,NIC.Embedded.1-1-1 [09:50:27] Updated value for attribute BIOS.Setup.1-1 -> BiosBootSeq (marked Set On Import to True): NIC.Embedded.1-1-1, HardDisk.List.1-1 => HardDisk.List.1-1, NIC.Embedded.1-1-1 [09:51:02] the host was not properly setup AFAICT or it was changed afterwords or a firmware upgrade changed some setting... [09:51:10] I'll let you know once it's done [09:51:19] I have the installer opened too [09:51:32] volans: I wonder if some other dbproxy will have the same issue (so far it is the only one) [09:51:36] the host might or might not get rebooted (depends on the idrac) [09:51:40] The only one I have seenthough [09:52:51] we could check the logs for their runs when provisioned the first time or re-check them if you want, it would not take too long to do [09:53:02] volans: nah, it is okay [09:57:16] volans: no pxe this time \o/ [09:57:44] volans: would it be worth it to add how you fixed this to the wikitech doc? [09:57:44] the cookbook is still finishing, not sure if idrac has rebooted the host mid-reimage though [09:58:18] volans: i don't think so, I had it opened [09:58:19] sorry I meant mid-d-i [09:58:29] ok cookbook finished [09:58:37] I just run [09:58:37] sudo cookbook sre.hosts.provision --no-dhcp --no-users --no-switch dbproxy1022 [09:58:43] I just hit "go" [09:58:53] volans: Ah ok! Thanks [09:58:56] and that can be run anytime, as long as you are ok with the host being rebooted [09:59:14] * marostegui adds it to his notes [09:59:17] thanks volans [09:59:21] the only bit that is used-dependent (if you add it to the docs) is the --enable-virtualization flag [09:59:37] that we enable only where needed [10:00:36] yeah, I am going to add it to the docs [10:00:40] Just in case [10:00:49] ack, thx [10:00:54] thank you for the help [10:02:50] anytime :) [14:30:13] godog: ah thanks, was wondering if it was related to that, but the error message was very unhelpful :) [14:33:06] yeah quite obscure alright, nothing I've seen before and after some poking I guessed it must be that, I still have a pontoon puppet buster around and things work just fine there [16:10:32] Is this something worth worrying about: [16:10:33] [04449ea4-3e03-4d65-a853-81b4e8b4beae] Caught exception of type Wikimedia\Rdbms\DBQueryError [16:10:51] happened when I tried to save an edit on enwiki [16:11:30] roy649, https://phabricator.wikimedia.org/T352628 [16:12:22] thanks [19:11:25] I'm setting up some WDQS benchmarks for https://phabricator.wikimedia.org/T336443 . Was looking at https://wikitech.wikimedia.org/wiki/Kafka/Kafka-main-raid-performance-testing-2019 for I/O testing, is this still best practice or are there other suggestions? [21:50:46] anyone seen this? "Error: The CRL issued by 'CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US' has expired, verify time is synchronized" (when running puppet agent?) [21:51:17] (and the time is correct) [21:53:01] urandom I wanna say that error really means a cert/key mismatch [21:53:48] oh. well that would be an unfortunately worded error message then. [21:55:36] ooooh, would this be the result of an unplanned transition to Puppet 7? [21:55:51] yeah, that could be it too [21:56:00] I mean, this host is being stood up new and seems to have defaulted to 7 [21:56:13] but it started when I applied the role. [21:56:18] is a downgrade possible? [21:57:10] that I dunno...let me remember what happened, because I def saw that before [22:01:43] OK, here's how I got that error message, it was after reimaging some hosts a couple weeks back. It does seem puppet 7 related https://phabricator.wikimedia.org/T351354#9340944 [22:02:09] I didn't put the exact error but I'm 99% sure it's the same one you're seeing [22:05:13] if you do decide to stick with puppet 7, you have to use the --new flag with the cookbook and add some hieradata a la https://gerrit.wikimedia.org/r/c/operations/puppet/+/975824/1/hieradata/hosts/cloudelastic1010.yaml [22:05:27] FWiW Puppet 7 seems to work fine for us [22:06:43] would a reimage be necessary? it was imaged with 7 I think [22:07:28] it was puppet 7 and puppet runs were clean before I moved out of insetup [22:11:03] yeah, moritz.m canary-ed one other host in this cluster (without reimaging), and it's a hieradata change like the above [22:15:53] inflatador: could you sanity-check https://gerrit.wikimedia.org/r/c/operations/puppet/+/980049? [23:06:13] urandom: the change looks like what we did for other hosts, +1ed [23:06:34] urandom: regadless, you can also find "fix forward" and "roll back" instructions on https://phabricator.wikimedia.org/T349619 [23:06:47] if you wanted to go another route [23:07:11] if the host is new and was already 7 it is likely because your "insetup" role is already converted to puppet7 on a role level [23:08:50] ultimately you would want to convert your entire prod role and remove the hiera stuff at host level. same Hiera keys but in hieradata/role/common/ instead of hierdata/hosts/ [23:09:53] the cookbooks for this are 'sre.puppet.migrate-host' and 'sre.puppet.migrate-role'. so you use those and don't have to do a full reimage [23:42:21] urandom ACK, just +1'd [23:50:57] mutante: perfect; thanks!