[14:22:47] the cloud network seems stable, so I will reimage the second cloudnet (cloudnet1005), which is currently the standby host [14:23:16] nice, lmk if I can help [14:23:33] thanks, hopefully this one will be smooth [14:23:51] not related to cloudnet, but do you have opinions/concerns on T350188? [14:23:52] T350188: [openstack] Fix ceph-common version in Bookworm - https://phabricator.wikimedia.org/T350188 [14:27:59] does ceph upstream have any guidance whether you should upgrade the servers or the clients first? [14:30:35] that's a very good question and I was hoping someone would already know the answer [14:30:49] but maybe only david does :) I'll do some research! [15:18:22] dhinus: thanks for looking at the tools-db oom thing. Did you also try the oom override thing? Or is that moot thanks to your other fixes? [15:19:05] I'm interested in more explanation from bd808 about that, by the way. I tend to assume that if the oom-killer fires /at all/ on a server that that server is in a potentially inconsistent state regardless of what it killed. [15:19:31] I found that seeting was already configured like bd808 was suggesting, but it's not enough [15:19:35] *setting [15:19:53] welp [15:20:08] seems like your changes must be helping since I haven't gotten any more alerts... [15:20:13] I left some more info in the phab task, including some tips from Manuel [15:20:18] is that just because I wasn't on call? [15:20:22] yep, reading [15:20:27] no it's been running since Friday [15:20:48] but Manuel said it's likely to have issues again in the future if we don't find the root cause [15:21:23] also, the mariadb docs seem to suggest most OOM in mariadb are due to misconfiguration, so we might have to tweak a few more variables [15:21:43] or to be more aggressive in killing long queries, there are some "interesting" ones in the slow query log [15:21:59] I think we should probably set a timeout, even a long one like 10 mins or so [15:22:38] I will keep an eye this week to see if there are any patterns or specific tools that are doing massive queries [15:22:51] but still, the server should not just "crash" if a user tries a bad query [15:25:29] Totally agree that maria ought to manage its own memory enough that the OS doesn't need to step in. As to whether it actually /can/ do that, I guess we will see :) [15:26:00] But it seems like things are already quite a bit better, so nice work! [15:26:07] the mariadb docs are kinda saying "we totally can do it, but you have to get the right combination of these 20 config variables" :P [15:26:56] in other news, cloudnet1005 is reimaged and looking fine, bridges came up correctly on the first boot [15:27:16] (while they needed an extra reboot in 1006, but it seems that taavi's fix worked!) [15:27:37] nice [15:28:17] I guess cloudrabbits are next, should I anticipate the usual cloudrabbit issues with split brains etc? :) [15:28:28] * andrewbogott sighs [15:28:41] yes [15:28:47] if you're around for the next hour I am slightly less worried :) [15:29:03] I am actually home today, and moderately less distracted! [15:29:13] nice. shall I fire off the reimage of the first one (cloudrabbit1001)? [15:29:33] sure [15:30:48] starting now [15:40:00] andrewbogott btw I see the TfInfraTest has been failing for a while, do we already have a task about it? [15:40:43] Not that I know of [15:46:38] btw I have this massive patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/971241 to make the openstack_controllers variable aware of cloud-private addresses. the PCC is a no-op on system changes which hopefully makes it easier to review [15:52:02] taavi: it was on my todo list for today, but another pair of eyes from andrewbogott would be useful :) [15:59:30] the first puppet run after the reimage of cloudrabbit1001 had a couple of ferm errors, but the second run was fine. the server is now rebooting. [16:03:51] I predict that we'll want to just do a whole reset/recreate after you're done but you might as well reimage all three first. [16:17:13] the cookbook has completed successfully on cloudrabbit1001 and puppet looks fine now (the run after the reboot had no changes) [17:11:40] dhinus: I haven't played minecraft at all, are there APIs so that a giant thing like the CHUNGUS was created algorithmically? Or did the creator literally point and click every one of those logic gates? Is there at least copy/paste? [17:12:30] I haven't played it either, I assume there was some automation? [17:12:40] I sure hope so [17:12:40] (we're talking about this https://www.youtube.com/watch?v=FDiapbD0Xfg) [17:13:02] it says "7 months of work"... so I wouldn't rule out it's manual :D [17:13:15] omg it has a raster display! [17:16:09] taavi: the task I was looking for is T345069 [17:16:10] T345069: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 [18:28:51] I need to go to a pre-planned lunch; the openstack APIs may be a bit broken but I will continue to fix things when I'm back. [19:13:06] * bd808 lunch