[01:02:13] * bd808 off [05:43:02] dhinus: I'm about to go to bed. I've reimaged up through cloudvirt1045 but 1043 and 1044 seem to be cursed -- T351171 [05:43:02] T351171: cloudvirt1043 + cloudvirt1044 reimage failures - https://phabricator.wikimedia.org/T351171 [05:43:28] You can go ahead and continue to drain and reimage, and save the edge cases for when I'm back a work. [05:45:03] Also -- 1051 is depooled due to a crash but I think we should just reimage it and repool. [05:48:09] taavi, I tried to wipe out the existing DB on cloudcontrol2005-dev and force it to re-sync, which it did but seems to be locked up again. [10:07:58] what is systemd-machined? just got a bunch of alerts of it being down on some cloudvirts [10:13:16] * taavi again resets git status on cloudcumin1001 cookbooks dir to make puppet happy :/ [10:45:59] no idea about systemd-machined :/ [10:46:42] re: cloudcumin, I will send a patch to workaround the amtool issue _without_ having to patch the cookbooks dir [11:56:34] I tried restarting systemd-machined on one host (cloudvirt1056) and it seems happy [11:56:54] but I would like to understand why it started failing on several hosts [11:58:11] the man page says "systemd-machined is a system service that keeps track of locally running virtual machines and containers." [11:58:54] it's part of the systemd-container package [12:12:59] aand they're back down [12:13:02] let me file a task etc [12:15:04] or no, it's just alertmanager sending a repeat alert. derp [12:57:45] i found an incorrect port from the galera firewall rules that might or might not explain the clustering issues we're seeing in codfw1dev: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974175/ [13:07:38] thanks, did it fix the clustering issue? [13:08:27] it's still applying :/ [13:18:32] hmm, I'm trying to reboot cloudcontrol2005-dev to get rid of any lingering mariadb processes and it seems to be stuck on waiting for the mariadb systemd unit to stop :/ [13:19:51] there we go, now it's booting I think [13:26:31] didn't seem to help [13:31:12] :/ [13:31:47] I think the entire cluster is in some weird deadlock state [13:33:58] Nov 14 13:33:38 cloudcontrol2004-dev mariadbd[1860]: 2023-11-14 13:33:38 0 [Warning] WSREP: Member 2.0 (cloudcontrol2001-dev.private.codfw.wikimedia.cloud) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable [13:35:46] I have never worked with Galera clusters before so I'm quite lost, I would wait for a.ndrew to see if he has any tips that worked in the past [13:41:18] I tried to follow https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Galera_won't_start_up, but that doesn't seem to help [13:41:36] yeah, I'll leave this for andrewbogott to sort out, I think I'll just break more things if I continue [13:45:14] hi all could i get a review on the 3 changes starting https://gerrit.wikimedia.org/r/c/operations/puppet/+/973840/1 (theses, at least the first one, are currently blocking the puppet7 migration) [13:45:30] * jbond for wmcs not everyone [13:46:23] looking [13:46:31] thanks taavi <3 [13:55:26] jbond: +1'd all three [14:01:59] I'm reimaging cloudvirt1046 but it's stuck waiting for reboot :/ the com2 console is blank [14:02:05] cheers taavi [14:51:44] dhinus: I am almost awake :) I see that you dropped cloudvirt1046 onto T351171 which seems right. [14:51:46] T351171: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171 [14:51:52] Any other immediate reimage/upgrade issues? [15:23:22] taavi: if you're not working on the galera thing I think I'm going to try dropping it back to a single node and rebuilding the cluster [15:23:33] andrewbogott: go for it [15:24:46] did you wind up with an opinion about which node is the most caught up? 2001-dev and 2004-dev both claim that they are... [15:25:05] no, no idea, sorry [15:25:21] but I was trying to recover the cluster on 2004, so it might be somewhat confused [15:25:34] ok [17:48:13] taavi: andrewbogott: is it ok to re-enable puppet on codfw1dev to do some migrations? [17:52:24] on the cloudcontrols? puppet will just hang forever due to a galera issue [17:52:26] work in progress [17:53:41] andrewbogott: ack ok ill abort and roll back for now [17:53:49] thanks, sorry [17:53:53] hopefully will be resolved later today [17:54:02] no problem [17:54:50] i allready migrated wmcs::openstack::codfw1dev::backups but ill leave it for the day nw [19:07:54] * bd808 lunch