[04:47:52] https://mariadb.org/documentation-as-pdf/ [07:06:55] marostegui: have you seen https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/787528 ? [07:35:16] OK, so DC team have kindly upgraded the firmware on ms-be2040, and it still won't PXE-boot. Still fails at exactly the same point (loads debian-installer/amd64/linux, loads debian-installer/amd64/initrd.gz, probes EDD OK). [07:36:33] Any further suggestions? We have 8 of these Dell PowerEdge R730xd that are due to be upgraded, and if none of them will PXE-boot, I'm a bit screwed. Even assuming the slightly-less ancient kit (we have 5 different hardware specs on the to-be-upgraded pile) will work :-/ [07:42:11] (plus the codfw ms cluster is now down one host since ms-be2040 is a brick) [07:42:29] Emperor: how sure are you that they don't boot? e.g. can you ping them? [07:43:32] kormat: I have the shiny HTML5 console to look at [07:43:43] kormat: I didn't know. That's good, although I normally disable notifications when doing reimages, so the warn or even critical doesn't page [07:43:55] Emperor: sure. i'm just thinking that maybe the _console_ is broken, but maybe it does actually boot [07:44:28] kormat: well, also the reimage cookbook is chuntering through Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' 92/120 and will shortly time out [07:44:31] marostegui: this + the downtime that the cookbook does should be sufficient [07:44:38] Emperor: ahh. ok. [07:45:28] Emperor: i'd blame debian, personally. ;) [07:45:48] I mean I guess I can send it round again with the HTML5 console disconnected against the slim possibility that's confusing the installer, but... [07:46:48] i've never heard of anyone using the html5 console here, so that's unusual... [07:47:08] seems like a very long shot, but i guess it's worth trying? [07:47:50] well, I'll wait the remaining few minutes for the reimage cookbook to time out then give that a go. I will be _very_ surprised if it helps, but you never know, and frankly I'm out of other ideas. [07:48:25] we're into the goat-sacrificing step of troubleshooting [07:50:22] Emperor: something else you could try is reimaging to buster [07:50:32] that might at least tell you if it's bullseye-specific [07:51:44] Mmm; again, seems low-likelihood, given how early in the process it's failing. I'm just sending it round on bullseye without any console connected, which'll take a while to fail [07:51:53] (timeout is 20 minutes) [07:53:05] [do we have effective smart remote power? If so I could try turning it actually off for a few minutes] [07:53:34] Emperor: just what's built into the mgmt interface [08:07:13] so if I want to try actually turning the whole thing off and on again, I hae to ask DC folk nicely? [08:07:43] yes. though: you can turn off the main chassis, and you can reboot the mgmt interface. which isn't exactly the same thing, but it's pretty close. [08:09:22] Emperor: btw, i had an issue yesterday where a machine i was reimaging didn't output anything on console for a good 5+ mins after pxe started. j.bond speculated that the installer console settings weren't correct (presumably the initial kernal console params) [08:09:39] e.g. it could be that the kernel is booting, and then hanging, and because its console settings are wrong you aren't seeing anything [08:10:05] Mmm [08:11:44] the existing console settings are probably specified in modules/install_server/files/tftpboot/bullseye-installer/pxelinux.cfg/ttyS1-115200 (or the ttyS0 version..) [08:13:33] settings there look essentially the same as the stretch-installer (which presumably is how this system was installed as stretch). [08:16:24] trying a cold reset of the BMC (and then power off) [08:18:34] * kormat nods [08:18:46] (i am just spitballing here) [08:21:05] another thing worth testing is to update the NIC firmware, https://phabricator.wikimedia.org/T286722 is for a different Broadcom NIC model, but it's also a 10G card [08:21:41] and the symptoms are similar, the card worked fine with an older distro, but the combination of never-updated-NIC-firmware along with the 5.10 kernel failed [08:21:58] 💡 iinteresting [08:22:58] unfortunately I don't know where/if we have docs describing the NIC firmware update, but it's something that does not require DC ops, Jaime only did it yesterday for one of the backup servers [08:24:09] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#NICs [08:25:46] I'd sort-of expect the kernel to boot if that were the issue [08:26:16] such optimism [08:27:22] trying again after resetting the BMC and leaving the chassis off for a while [08:29:44] godog: Hey! Just checked today, tegola with the new container is pretty stable. I think we can stop copying files. [08:30:37] moritzm: that wikitext page says "With the NIC model you can download the driver from the Netbox shortlink" I'm not sure what it means by that - our netbox system knows where to find NIC firmware updates? [08:31:38] Emperor: netbox has a link to the dell config page for the server [08:31:55] (which.. TIL) [08:36:29] so, at a guess, this is the latest firmware for the 10G nic: https://www.dell.com/support/home/en-uk/drivers/driversdetails?driverid=npnt5&oscode=naa&productcode=poweredge-r730xd [08:40:45] console com2 has something on this time [08:41:11] suggesting it has got into the installer, and then something went wrong [08:41:31] ah [08:41:49] complaining about non-free firmware files to operate the NIC /o\ [08:42:03] installer is sitting at the "Load missing firmware from removable media" prompt [08:42:52] Do our installer images not have non-free firmware on? [08:43:22] bnx2x/bnx2x-e2-7.13.21.0.fw [08:44:33] so we're using the default d-i images, but when we kludge the firmware tarball into it [08:45:02] but actually, now that you mention bnx2x fw in specific [08:46:25] I tried "" at the load non-free firmware prompt, no joy (it just gave me the same thing again), so I tried "", and the installation is at least progressing... [08:46:41] this might actually be caused by the fix for https://phabricator.wikimedia.org/T306148 [08:46:48] the background is: [08:46:51] science 🧪! [08:47:13] these Broadcom cards have optional firmware for some features we don't use [08:47:42] the base operation works just fine without it (and IIRC the modules are also not currently packaged in firmware-nonfree) [08:48:04] so this triggered an interactive prompt in T306148 [08:48:05] T306148: clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye - https://phabricator.wikimedia.org/T306148 [08:48:19] let me revert the patch and then re-attempt the installation of ms-be1040 [08:48:58] can't wait for this crap to finally resolved with shipping the firmware in default install media [08:49:04] moritzm: do you want me to do something to abort the current install? [08:50:35] firmware> yeah, I broadly agree with Steve M's blogpost on the subject [08:50:56] wrt "it just gave me the same thing again), so I tried "", and the installation is at least progressing." [08:51:20] -> that actually means that my patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/784259 didn't work as expected [08:51:50] moritzm: at least you have some useful data from my pain :) [08:51:57] but the upside is that manually connecting to the mgmt and choosing "No" is an adequate workaround to get this server installed [08:52:11] the full story is: [08:53:03] in the past before Bullseye missing firmware was simply silently failing to load in d-i [08:53:49] but with various current GPUs not even be able to render a framebuffer for graphic d-i in the absence of AMD firmware [08:53:56] workaround> I ahve 7 more of this class to reimage; but they all need doing one-at-a-time and then waiting for swift to sort itself out, so I can do all them them thus if necessary. [08:54:08] hw-detect introduced this https://tracker.debian.org/news/1245038/accepted-hw-detect-1145-source-into-unstable/ [08:54:31] and this now detects that firmware is required and prompts for it [08:54:40] but for this specific NIC model type [08:54:47] that's only half of the story [08:54:56] ah, yes [08:55:07] since it _does_ work perfectly fine without firmware [08:55:44] but still the metadata in the kernel refers to the optional firmware and thus prompts the prompt we're seeing [08:55:53] so we need some sort of "don't worry about these missing firmwares" knob to twiddle [08:56:16] yes, that's what https://gerrit.wikimedia.org/r/c/operations/puppet/+/784259 was supposed to do [08:56:31] but it seems I either made a mistake or something else is needed [08:57:37] you're sure it's not a cached old cfg or somesuch? [08:57:41] I'll poke at this later when I'm doing with the Ganeti update, but in the interim let's simply select the "no" prompt as a workaround until a proper fix it found [08:57:49] moritzm: +1 [08:58:21] the DHCP config gets written out by Puppet, it should be up-to-date [08:58:34] mayb the syntax is difrernet, I'll poke at hw-detect later [08:58:53] or maybe it's simply broken in d-i and noone noticed :-) [08:59:02] always a possibility :) [08:59:26] on the bright side [09:00:04] I'm now wondering if I missed this set of failures yesterday and the BIOS upgrade wasn't necessary to get it to this point. [09:00:14] I guess I can try another host later [09:00:31] I'm pretty sure we got misled [09:00:31] Emperor: it seems most likely to me that you've ran into 2 separate issues [09:00:48] so on the bright side we might not need firmware updates [09:00:56] that would be good [09:01:03] soo many colourful hardware errors to run into :-) [09:01:10] never gets boring [09:01:18] The Sanger's kit was all "the text console is hopeless, always try HTML5" [09:01:45] also, you have to think to try ^L at the text console before you get the error message [09:02:29] If this hosts finishes installing OK and swift looks alright, I can try an eqiad host to see if it'll reimage without the f/w upgrade [09:02:37] John is currently working on automating firmware updates via Spicerack, so hopefully we can at some point simply run these via a cookbook (or even fold them as a regular step into the reimage cookbooks) [09:02:44] ack, sounds good [09:03:09] if not, Willy is expecting bad news from me... [09:06:16] Huh, happy 12-year anniversary of my Erdős number [09:08:31] moritzm: could it be tab vs space in the 'boolean false' part? [09:10:22] nemo-yiannis: ack! {{done}} and created T307184 for followups [09:10:23] T307184: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 [09:11:39] thanks godog [09:15:53] sure np [09:18:37] volans|off: maybe, I'll have a closer look in a bit [09:20:44] godog: puppet is failing on ms-be2040 post-reimage something about xfs labelling not working... [09:21:38] /dev/sda4 on /srv/swift-storage/sdb4 type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=32k,noquota) [09:22:07] is probably not quite right (likewise /dev/sdb4 is mounted at /srv/swift-storage/sda4) - worth a reboot to see if they come back the right way round, or has something gone badly wrong here? [09:23:26] * Emperor tries a reboot [09:24:38] Emperor: yeah worth a reboot, I'm guessing the first puppet run labelled them one way and then post-reboot they came back swapped [09:27:01] looking better post-reboot let's see if puppet completes now [09:28:18] yes. [09:28:40] it'll be interesting to see how long it takes to re-populate the swift partitions on the SSDs [09:30:03] reasonably fast IME, in the order of a couple of hours IIRC [09:42:17] marostegui: poke re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/775330 [10:01:45] ms-be1040 (eqiad host, same vintage) gets to the same firmware-needed prompt; let's see if the install works [10:03:29] well that's concerning: https://phabricator.wikimedia.org/P27009 [10:05:49] server on strike [10:06:27] ok, it claims it's on now, but there's zero output on the console [10:07:54] kormat: maybe try a hard reset [10:08:13] /topic all hardware is terrible [10:08:35] marostegui: oh, good idea, trying. [10:10:54] Emperor: 💯 [10:11:30] marostegui: any idea how long until a hardreset takes effect? [10:11:36] coz i'm still staring into the void [10:12:56] should usually take at most a minute [10:13:19] ok. it's been ~5. maybe a `racadm racreset` for good measure? [10:14:39] trying it [10:14:45] yeah, if that fails taht needs a dc ops ticket [10:22:42] feh, nothing. dc ops it is. [10:23:07] worth tyring a poke at the web-IPMI? It has more buttons... [10:23:26] Emperor: what kind of buttons? [10:23:57] Depends a bit, but often resetting bits of the IPMI system [10:25:19] ugh, ms-be1040 came up with drives mounted in the wrong place, let's see if a reboot helps [10:27:45] Emperor: the d-i setting itself seems just fine to me, but we could try https://gerrit.wikimedia.org/r/c/operations/puppet/+/787704/ on the next swift with such a Broadcom NIC? [10:29:17] moritzm: certainly. [10:32:21] Emperor: oh! https://phabricator.wikimedia.org/T307198#7890645 [10:33:12] fsck it, the reimage didn't get the ownership right [10:54:05] where do reimage logs end up? The changes I made in https://gerrit.wikimedia.org/r/c/757025 obviously aren't working, but it's difficult to know why... [11:02:04] Emperor: for cookbooks logs see https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Logs, but late_command is executed by d-i, so will not be there [11:02:32] volans|off: Mmm, it's the late_command output (if any) I'd like to see - it _ought_ to be working, but clearly isn't [11:02:35] you can check the logs in the d-i env if they have something, before d-i completes [11:03:05] late command is run via d-i preseed/late_command [11:03:28] ah, it's in /var/log/installer [11:03:36] /tmp/late_command: line 57: stat: not found [11:03:37] FFS [11:04:07] busybox has stat, though. [11:04:26] does d-i have a non-standard busybox or something? [11:05:05] at which stage is d-i? are you in the installer environment? [11:05:11] the new OS is in /target [11:05:18] the chroot [11:06:31] volans|off: late_command [11:06:34] see https://www.debian.org/releases/stable/amd64/apbs05.en.html [11:07:03] B.5.1. [11:08:28] volans|off: late_command has /target available to it, but I thought was running busybox sh, so should have busybox utils available to it? [11:09:12] it's run inside the chroot of the new OS AFAIK, not busybox [11:09:42] all the commands that have in-target [11:09:54] should be run inside the chroot, but I'm no d-i expert, sorry [11:10:11] your patch doesn't have in-target AFAICT [11:10:30] volans|off: I don't want it running in the target, I want the installer to mount a filesystem and call stat on the contents [11:11:18] and then I call in-target groupadd/useradd based on the outcome [11:12:52] it's successfully mounting the FS OK, but somehow isn't finding a "stat" to call, which I don't understand because isn't the installer shell busybox which has a "stat" builtin? [11:13:32] maybe PATH is not set there and you need the full path? [11:13:48] /usr/bin/stat I guess [11:14:02] but I see other commands working fine [11:14:04] like ip [11:14:10] leg excerpts at https://phabricator.wikimedia.org/T300057#7890740 [11:14:31] Isn't the point of busybox that they're all builtins? [11:24:19] the commands offered by busybox are all controlled by build flags, it's probably the udeb missing stat? [11:25:01] I wonder if there's any other way of solving this problem then :-/ [11:25:08] the whole concept of udebs is really moot at this point, simply using the default debs would reduce so much complexity [11:25:37] and the days of installer hardware needing to squeeze out a few kilobytes are also over for a long time... [11:26:39] one workaround would be to install coreutils in the late install script and then use stat from there? [11:27:37] https://salsa.debian.org/installer-team/busybox/-/blob/master/debian/config/pkg/udeb#L315 <-- confirms that the busybox udeb doesn't build stat :( [11:29:04] meh [11:29:08] moritzm: and then 'swiftuid=$(/target/usr/bin/stat -c '... ? [11:29:20] worth a try, I guess [11:29:52] yeah, it's not elegant, but then none of the late install script handling swift UIDs is pretty to begin with :-) [11:33:43] I'll put a patch together after lunch, then. [11:53:44] I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/787704, let me know if you still the prompt for the next reimage [12:04:49] https://gerrit.wikimedia.org/r/c/operations/puppet/+/787717 +1 perhaps? [12:13:48] godog: ms-be1040 has a bunch of very unhappy xfs partitions :( e.g. sdj1 won't mount, and xfs_repair says to try and mount it before re-attempting xfs_repair [12:14:00] (and offers -L to discard the log with scary warnings about data loss) [12:14:13] at least 4 partitions in this state [12:18:12] dunno if this is latent corruption from stretch we just missed, damage from upgrade or what [12:22:49] godog: do you have a feel if it's worth trying xfs_repair -L ? No media issues reported in kernel.log [12:26:08] Emperor: no idea tbh but if the partition isn't mountable anyways then might as well try [12:34:46] OK, will give it a go. [12:37:00] Emperor: are the disks in the expected order though? not related to the corruption but relevant if we're re-formatting of course [12:49:09] godog: oh, no, they were on one reboot, but seem out again. Argh, this is getting very tedious [12:49:54] * Emperor much prefers UUID-based mounting [12:51:27] this is all a mess :( [12:51:49] once xfs_repair has done its thing I'll reboot again again and see if the disks come back more sensibly. [12:52:42] godog: I notice ms-be2040's disks are still mixed up too, and that's after I rebooted it once to get sd{a,b} right again :( [12:56:33] Emperor: yeah I've seen that happen too, it is unfortunate alright [13:27:08] marostegui: sounds like db1164 is going to be out of service for a while, do we need to do any rebalancing in the meantime? T307198 [13:27:08] T307198: db1164 fails to POST/boot/etc - https://phabricator.wikimedia.org/T307198 [13:27:13] reboot hasn't fixed it, different set of drives permuted [13:28:26] kormat: no, it should be fine [13:28:35] marostegui: ok cool, thanks [13:33:27] godog: I've rebooted ms-be2040 4 times now, and each time the drives don't come up in the correct order, and it's a different permutation each time. [13:36:50] Emperor: mmhh which drives permute ? the ssd or hdd or a mix ? [13:37:00] hdds [13:37:14] other than the initial post-install reboot when the SSDs were wrong, they have remained correct [13:38:32] we've had m->j, j->i, i->k, k->m ; f->g, g->h, h->f, l->m, m->l ; and d->e, e->c, c->d, n->m, m->n, l->k, k->l on the last few reboots [13:41:59] Emperor: ack, IME one/two reboots are sufficient to fix the order, though nothing is immediately wrong due to labels, can wait next week I think [13:42:19] godog: it maybe that with current kernels the order is never going to be stable [13:44:07] that's certainly possible too [13:50:03] (in the mean time, going to try another codfw backend upgrade, since that cluster is happy) [14:07:49] moritzm: I'm afraid ms-be2041's installer has still got to the non-free firmware prompt under Detect network hardware [14:08:36] meh [14:08:47] then the setting itself is probably broken [14:10:01] once the install's finished, you're welcome to the logs from it :) [14:16:21] I think I'll just axe the setting, it only affects a handful of hosts, so we can treat it as a known bug and with bookworm the whole firmware mess will have vanished [14:17:02] even if we track it down and land a fix, it would need a backport to bullseye's d-i accepted and would only be avalable in the subsequent bullseye point release [14:17:18] doesn't really seem worth it [14:17:20] Mmm, it's not the most annoying thing about reimaging these systems :-/ [14:30:58] swift uid/gid are set right, though. [14:33:15] godog: ms-be2041 reimaged OK, but its hdds are in a jumbled order again [14:33:44] siiiigh [15:32:34] (in other news, we have about 2 billion objects in the ms- cluster) [15:56:03] godog: ms-be2042 is repeatedly putting its SSDs in the "wrong" place, which is making puppet fail, which is stopping the reimage from completing. [15:56:14] On reboot #3 to try fix this :-/ [15:59:58] #4 [16:00:41] it's starting to look like this system is going to consistently put the SSDS the other way round from the installer [16:04:28] which means puppet will never work because it wants to label the partitions but they're not where it expects them to be [16:10:07] and, indeed, it's trying to mkfs on /dev/sdc1 but there's already a fs there [16:15:06] xfs_admin -L swift-sda3 /dev/sda3 keeps failing because /dev/sda3 is already labelled swift-sdb3 [16:16:11] puppet is never going to work here, and I've rebooted 6 times now [16:17:35] and there's a chunk of data in these filesystems already [16:18:17] I guess I could try stopping swift, unmounting the filesystems and adjusting the labels. [16:24:46] done so, puppet now runs to completion. But this approach is v. v. fallible and doesn't scale [16:25:57] I remain confused about why puppet seems to mind mis-labelled hdds less than the SSDs [16:26:25] I mean /dev/sdn1 here is labelled swift-sdl1 and puppet isn't catching fire about that [16:33:30] I'm making a couple of schema changes on db1156 live (T276292) [16:33:31] T276292: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 [18:24:35] I've left ms-be1040 doing a bunch of xfs_repair in a tmux [18:26:07] (and Ack'd the icinga alert until Tuesday) [18:39:16] * Emperor will try and leave it alone for the rest of the weekend [18:42:14] Emperor: if it helps, I broke backup1002 too :-(