[14:22:47] <dhinus>	 the cloud network seems stable, so I will reimage the second cloudnet (cloudnet1005), which is currently the standby host
[14:23:16] <taavi>	 nice, lmk if I can help
[14:23:33] <dhinus>	 thanks, hopefully this one will be smooth
[14:23:51] <dhinus>	 not related to cloudnet, but do you have opinions/concerns on T350188?
[14:23:52] <stashbot>	 T350188: [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188
[14:27:59] <taavi>	 does ceph upstream have any guidance whether you should upgrade the servers or the clients first?
[14:30:35] <dhinus>	 that's a very good question and I was hoping someone would already know the answer
[14:30:49] <dhinus>	 but maybe only david does :) I'll do some research!
[15:18:22] <andrewbogott>	 dhinus: thanks for looking at the tools-db oom thing. Did you also try the oom override thing? Or is that moot thanks to your other fixes?
[15:19:05] <andrewbogott>	 I'm interested in more explanation from bd808 about that, by the way. I tend to assume that if the oom-killer fires /at all/ on a server that that server is in a potentially inconsistent state regardless of what it killed.
[15:19:31] <dhinus>	 I found that seeting was already configured like bd808 was suggesting, but it's not enough
[15:19:35] <dhinus>	 *setting
[15:19:53] <andrewbogott>	 welp
[15:20:08] <andrewbogott>	 seems like your changes must be helping since I haven't gotten any more alerts...
[15:20:13] <dhinus>	 I left some more info in the phab task, including some tips from Manuel
[15:20:18] <andrewbogott>	 is that just because I wasn't on call?
[15:20:22] <andrewbogott>	 yep, reading
[15:20:27] <dhinus>	 no it's been running since Friday
[15:20:48] <dhinus>	 but Manuel said it's likely to have issues again in the future if we don't find the root cause
[15:21:23] <dhinus>	 also, the mariadb docs seem to suggest most OOM in mariadb are due to misconfiguration, so we might have to tweak a few more variables
[15:21:43] <dhinus>	 or to be more aggressive in killing long queries, there are some "interesting" ones in the slow query log
[15:21:59] <dhinus>	 I think we should probably set a timeout, even a long one like 10 mins or so
[15:22:38] <dhinus>	 I will keep an eye this week to see if there are any patterns or specific tools that are doing massive queries
[15:22:51] <dhinus>	 but still, the server should not just "crash" if a user tries a bad query
[15:25:29] <andrewbogott>	 Totally agree that maria ought to manage its own memory enough that the OS doesn't need to step in.  As to whether it actually /can/ do that, I guess we will see :)
[15:26:00] <andrewbogott>	 But it seems like things are already quite a bit better, so nice work!
[15:26:07] <dhinus>	 the mariadb docs are kinda saying "we totally can do it, but you have to get the right combination of these 20 config variables" :P
[15:26:56] <dhinus>	 in other news, cloudnet1005 is reimaged and looking fine, bridges came up correctly on the first boot
[15:27:16] <dhinus>	 (while they needed an extra reboot in 1006, but it seems that taavi's fix worked!)
[15:27:37] <andrewbogott>	 nice
[15:28:17] <dhinus>	 I guess cloudrabbits are next, should I anticipate the usual cloudrabbit issues with split brains etc? :)
[15:28:28] * andrewbogott sighs
[15:28:41] <andrewbogott>	 yes
[15:28:47] <dhinus>	 if you're around for the next hour I am slightly less worried :)
[15:29:03] <andrewbogott>	 I am actually home today, and moderately less distracted!
[15:29:13] <dhinus>	 nice. shall I fire off the reimage of the first one (cloudrabbit1001)?
[15:29:33] <andrewbogott>	 sure
[15:30:48] <dhinus>	 starting now
[15:40:00] <dhinus>	 andrewbogott btw I see the TfInfraTest has been failing for a while, do we already have a task about it?
[15:40:43] <andrewbogott>	 Not that I know of
[15:46:38] <taavi>	 btw I have this massive patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/971241 to make the openstack_controllers variable aware of cloud-private addresses. the PCC is a no-op on system changes which hopefully makes it easier to review
[15:52:02] <dhinus>	 taavi: it was on my todo list for today, but another pair of eyes from andrewbogott would be useful :)
[15:59:30] <dhinus>	 the first puppet run after the reimage of cloudrabbit1001 had a couple of ferm errors, but the second run was fine. the server is now rebooting.
[16:03:51] <andrewbogott>	 I predict that we'll want to just do a whole reset/recreate after you're done but you might as well reimage all three first.
[16:17:13] <dhinus>	 the cookbook has completed successfully on cloudrabbit1001 and puppet looks fine now (the run after the reboot had no changes)
[17:11:40] <andrewbogott>	 dhinus: I haven't played minecraft at all, are there APIs so that a giant thing like the CHUNGUS was created algorithmically? Or did the creator literally point and click every one of those logic gates?  Is there at least copy/paste?
[17:12:30] <dhinus>	 I haven't played it either, I assume there was some automation?
[17:12:40] <andrewbogott>	 I sure hope so
[17:12:40] <dhinus>	 (we're talking about this https://www.youtube.com/watch?v=FDiapbD0Xfg)
[17:13:02] <dhinus>	 it says "7 months of work"... so I wouldn't rule out it's manual :D
[17:13:15] <andrewbogott>	 omg it has a raster display!
[17:16:09] <dhinus>	 taavi: the task I was looking for is T345069
[17:16:10] <stashbot>	 T345069: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069
[18:28:51] <andrewbogott>	 I need to go to a pre-planned lunch; the openstack APIs may be a bit broken but I will continue to fix things when I'm back.
[19:13:06] * bd808 lunch