[10:24:53] I'm going to reimage the remaining cloudcontrols to bookworm (cloudcontrol1006 and cloudcontrol1005) [10:25:21] I'm slightly worried by T350188, but it's probably more of a concern for cloudvirts [10:25:21] T350188: [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 [10:26:27] ack [12:05:13] hey, cloudcontrol1006 is spaming root mail several times per second with a sudo alert [12:05:46] since 11 minutes ago [12:06:27] hmm, dhinus is reimaging that host at the moment [12:06:51] I can't seem to be able to log in as either my normal user or via install-console [12:13:03] I'm in, it's a puppet ordering issue as I suspected. I'll send a patch to prevent it happening in the future [12:17:26] jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971141/ [12:20:19] thanks, not worried about the fix, just in case it continued so fast [12:24:38] thanks, the reimage has now completed successfully [14:10:27] jayme: if can you +1 the patch by taavi above, I will wait for it to be merged before I reimage another cloudcontrol :) [14:11:28] sorry, I meant to ping jynus :) [14:11:43] * jayme was already panicking :) [15:20:06] please don't get blocked on me, I trust you do the right thing [15:20:15] I was afk [15:20:40] I only pinged because I thought it was going to continue at the same speed for long, now it is not a hurry [15:21:08] ok thanks! [16:39:06] I'm starting the reimage of the last cloudcontrol (1005) [16:39:58] dhinus: I think you still need to do the fernet key dance on 1006 [16:44:04] thanks, let me have a look [16:45:07] I was hoping 1007 was just an exception where keys got messed up [16:47:30] the way we spotted it in 1007 was because of errors in Horizon and the CLI... [16:47:42] is there a way to check if the keys are in sync without regenerating them? [16:48:14] dhinus: I think there's an rsync -d in play, so if any one node has an empty directory the emptiness propagates... I'm not 100% sure that's right though [16:48:45] it depends on how the sync job works, yeah [16:48:49] There's not an automated way to check consistency but you could do checksums in the dir [16:49:09] checksums sound like a good idea, let's try [16:49:20] or better yet just force all the rsyncs to happen in the right order right now. Let me find the code that does that... [16:50:51] dhinus: my gittiles knowledge is failing me right now but the file you want is modules/profile/manifests/openstack/base/keystone/fernet_keys.pp [16:51:31] /usr/bin/keystone-manage fernet_rotate --keystone-user keystone --keystone-group keystone [16:51:39] and then [16:51:40] /usr/bin/rsync -a --delete rsync://${thishost}/keystonefernetkeys/ /etc/keystone/fernet-keys/ [16:52:10] checksums look fine, I did "find /etc/keystone/fernet-keys -type f |sort |xargs md5sum" on all 3 hosts and they look the same [16:52:15] well... sorry it's slightly weirder than that but the code will make more sense [16:52:22] oh, in that case you don't need to do anything :) [16:52:35] Is something actually broken right now? [16:52:49] nope, it did break two days ago when I reimaged 1007 [16:52:56] but today all seems fine [16:53:07] timing + luck I suppose [16:53:25] agreed [16:53:38] All of the complications you're hitting are due to the OS upgrade; doing an openstack version upgrade without a re-image is way easier I promise [16:54:23] haha I trust you! thankfully debian upgrades are not as frequent :) [16:54:49] I will proceed with reimaging 1005 [17:18:37] taavi: I'm seeing some errors on the first puppet run in cloudcontrol1005, and they are related to Openstack::Patch [17:18:43] can I ask for reviews for https://gerrit.wikimedia.org/r/c/operations/puppet/+/971211 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/971240? developing the next patch in the stack is very annoying if those two are showing up in PCC as changes [17:28:55] * bd808 early lunch/errands [17:46:33] the puppet errors in cloudcontrol1005 are no longer there on the second run [17:46:49] the reimage completed successfully [17:47:40] fernet keys are empty in 1005, let's see if they sync automatically [17:49:58] hmm looks like the first trigger of the timer is tomorrow, I will run it once manually [17:50:19] adding a trigger to the fernet rsyncs directly after boot might not be a bad idea [17:52:12] yes I was thinking the same [17:52:21] I ran "systemctl start keystone_sync_keys_from_cloudcontrol1006.eqiad.wmnet.service" and keys are now looking consistent across all 3 nodes [17:55:01] I have to log off and I didn't find the time to debug the toolsdb OOM issue, so it might happen again today... [17:55:33] I will focus on that issue tomorrow, but if you have any ideas please let me know [17:56:21] dhinus: I have a free hour so I'll do some post-reimage tests. Have a good evening! [17:56:28] thanks! [17:57:01] cloudcontrols are now all on bookworm, all the other hosts are still on bullseye [17:57:43] * dhinus off [23:20:21] * bd808 off