[01:02:13] * bd808 off
[05:43:02] <andrewbogott>	 dhinus: I'm about to go to bed.  I've reimaged up through cloudvirt1045 but 1043 and 1044 seem to be cursed -- T351171
[05:43:02] <stashbot>	 T351171: cloudvirt1043 + cloudvirt1044 reimage failures - https://phabricator.wikimedia.org/T351171
[05:43:28] <andrewbogott>	 You can go ahead and continue to drain and reimage, and save the edge cases for when I'm back a work.
[05:45:03] <andrewbogott>	 Also -- 1051 is depooled due to a crash but I think we should just reimage it and repool.
[05:48:09] <andrewbogott>	 taavi, I tried to wipe out the existing DB on cloudcontrol2005-dev and force it to re-sync, which it did but seems to be locked up again.
[10:07:58] <taavi>	 what is systemd-machined? just got a bunch of alerts of it being down on some cloudvirts
[10:13:16] * taavi again resets git status on cloudcumin1001 cookbooks dir to make puppet happy :/
[10:45:59] <dhinus>	 no idea about systemd-machined :/
[10:46:42] <dhinus>	 re: cloudcumin, I will send a patch to workaround the amtool issue _without_ having to patch the cookbooks dir
[11:56:34] <dhinus>	 I tried restarting systemd-machined on one host (cloudvirt1056) and it seems happy
[11:56:54] <dhinus>	 but I would like to understand why it started failing on several hosts
[11:58:11] <dhinus>	 the man page says "systemd-machined is a system service that keeps track of locally running virtual machines and containers."
[11:58:54] <dhinus>	 it's part of the systemd-container package
[12:12:59] <taavi>	 aand they're back down
[12:13:02] <taavi>	 let me file a task etc
[12:15:04] <taavi>	 or no, it's just alertmanager sending a repeat alert. derp
[12:57:45] <taavi>	 i found an incorrect port from the galera firewall rules that might or might not explain the clustering issues we're seeing in codfw1dev: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974175/
[13:07:38] <dhinus>	 thanks, did it fix the clustering issue?
[13:08:27] <taavi>	 it's still applying :/
[13:18:32] <taavi>	 hmm, I'm trying to reboot cloudcontrol2005-dev to get rid of any lingering mariadb processes and it seems to be stuck on waiting for the mariadb systemd unit to stop :/
[13:19:51] <taavi>	 there we go, now it's booting I think
[13:26:31] <taavi>	 didn't seem to help
[13:31:12] <dhinus>	 :/
[13:31:47] <taavi>	 I think the entire cluster is in some weird deadlock state
[13:33:58] <taavi>	 Nov 14 13:33:38 cloudcontrol2004-dev mariadbd[1860]: 2023-11-14 13:33:38 0 [Warning] WSREP: Member 2.0 (cloudcontrol2001-dev.private.codfw.wikimedia.cloud) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
[13:35:46] <dhinus>	 I have never worked with Galera clusters before so I'm quite lost, I would wait for a.ndrew to see if he has any tips that worked in the past
[13:41:18] <taavi>	 I tried to follow https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Galera_won't_start_up, but that doesn't seem to help
[13:41:36] <taavi>	 yeah, I'll leave this for andrewbogott to sort out, I think I'll just break more things if I continue
[13:45:14] <jbond>	 hi all could i get a review on the 3 changes starting https://gerrit.wikimedia.org/r/c/operations/puppet/+/973840/1 (theses, at least the first one, are currently blocking the puppet7 migration)
[13:45:30] * jbond for wmcs not everyone
[13:46:23] <taavi>	 looking
[13:46:31] <jbond>	 thanks taavi <3
[13:55:26] <taavi>	 jbond: +1'd all three
[14:01:59] <dhinus>	 I'm reimaging cloudvirt1046 but it's stuck waiting for reboot :/ the com2 console is blank
[14:02:05] <jbond>	 cheers taavi
[14:51:44] <andrewbogott>	 dhinus: I am almost awake :)  I see that you dropped cloudvirt1046 onto T351171 which seems right. 
[14:51:46] <stashbot>	 T351171: cloudvirt104[346] reimage failures - https://phabricator.wikimedia.org/T351171
[14:51:52] <andrewbogott>	 Any other immediate reimage/upgrade issues?
[15:23:22] <andrewbogott>	 taavi: if you're not working on the galera thing I think I'm going to try dropping it back to a single node and rebuilding the cluster
[15:23:33] <taavi>	 andrewbogott: go for it
[15:24:46] <andrewbogott>	 did you wind up with an opinion about which node is the most caught up?  2001-dev and 2004-dev both claim that they are...
[15:25:05] <taavi>	 no, no idea, sorry
[15:25:21] <taavi>	 but I was trying to recover the cluster on 2004, so it might be somewhat confused
[15:25:34] <andrewbogott>	 ok
[17:48:13] <jbond>	 taavi: andrewbogott: is it ok to re-enable puppet on codfw1dev to do some migrations?
[17:52:24] <andrewbogott>	 on the cloudcontrols?  puppet will just hang forever due to a galera issue
[17:52:26] <andrewbogott>	 work in progress
[17:53:41] <jbond>	 andrewbogott: ack ok ill abort and roll back for now
[17:53:49] <andrewbogott>	 thanks, sorry
[17:53:53] <andrewbogott>	 hopefully will be resolved later today
[17:54:02] <jbond>	 no problem
[17:54:50] <jbond>	 i allready migrated wmcs::openstack::codfw1dev::backups but ill leave it for the day nw
[19:07:54] * bd808 lunch