[02:30:40] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2028:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:16] ^-- expected? seems to have been unhappy for a while now [08:20:32] I silened it a few hours ago [08:20:46] It is a debian trixie testing host [08:21:33] Mariadb was stopped since yesterday [09:01:16] Emperor: o/ sretest2010 seems back to life, ok if I test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1200284 ? [09:01:49] elukey: please go ahead [09:02:48] perfect [09:08:54] Emperor: if you have a moment, could you give it a sanity check? [09:09:11] pebkac prevention scheme :D [09:10:09] 👀 [09:11:56] thanks :) [09:12:59] Emperor: last thing - IIUC from your tests if I try to reimage a couple of times I should get hit the grub mduuid not found issue right? [09:13:27] what I want to understand atm is if the partman_early_command.sh plays a role [09:26:13] elukey: if only it were that predictable; but roughly, yes. [09:26:44] interesting things to look at are what ends up on the first partition of the two SSDs, and what efibootmgr (or somesuch) says about boot ordering of those two vs netboot &c [09:28:41] ack. Not sure if you saw the updates in the task but I suspect that the issue that we are seeing may be related to a supermicro bug that we have been working on [09:29:06] namely, d-i's efi boot settings are not preserved (sometimes) after the reboot to load the os [09:29:15] I've been a bit distracted by other (also) urgent things, but I did see something like that go past [09:29:24] elukey: ugh [09:30:15] usually the only downside that we have seen in the past was the need of two reimages, because the first one ended up two times HTTP EFI Booting and debian-installing again etc.. [09:30:32] but nothing like the bug reported by your task [09:30:44] so maybe this is a combo of that problem + partman_early_command [09:32:34] seems a little unlikely, but worth a look [09:33:29] well I didn't see anything like you reported when I reimaged the host, I may have been really unlucky but I was fairly convinced the host was behaving the the right way [09:39:14] If anyone's got a minute, could they give a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1200288 please? I'll not merge 'til Monday, but this gets two more SM boxes with replaced disk controllers back into the rings [09:55:15] looking [09:55:52] I was checking it [09:55:59] but ok [10:17:16] thanks both :) [11:27:46] So I think I am now happy with the latest patch, feature-wise with transfer.py [11:28:21] I had to undo the escaping and redo it a different way, but now it works (before it failed with files with spaces) [11:28:41] and added unit tests to check escaping works in most cases [11:29:12] this was due to os.path.join/basedir not working correctly with a previously escaped string [11:29:51] I now only have to add unit tests to the new firewall code and I am ready for a release [14:31:33] I am running a recovery towards db2202 (but I won't touch the mysqld process) [15:00:09] it finished correctly [15:00:25] have a nice day [15:08:43] so unfortunate that that's an anagram of "have a cyanide"