[08:21:35] mmhh thanos-be2004 is basically out of disk space on sdb3 (the ssd for containers), I'm taking a look at what's up with that [08:23:00] there's a couple of big container dbs in "quarantined" status, what I found so far [08:27:36] :-/ [08:29:13] does thanos have smaller SSDs or just containers with many more entries in? [08:31:17] ssd size should be the same IIRC, but yeah the tegola/osm containers are huge [08:31:37] not a new problem sadly, i.e. https://phabricator.wikimedia.org/T307184 [08:40:19] I'll open a task for tracking and move the quarantined container dbs out of the way into the sd[ab]4 partitions [08:40:41] which we could shrink to make space for the container partitions, if it comes to that [09:18:33] godog: checking on replication, the frontends are getting ECONNREFUSED on polling the new backend nodes for replication info; do I need to do a rolling-restart or somesuch (e.g. to make something notice the change to swift::storagehosts)? [09:18:42] (this is ms) [09:25:30] backends seem to have got at least some data on them, so I think they're working OK [09:27:08] backends have a ferm conf.d file with swift nodes in [09:31:25] Hm, problem seems to be on the backends. [09:32:40] godog: new object backends lack a listener on port 6022, whereas older ones have something listening (swift-object-server). Going to try restarting swift on one backend [09:33:57] Aug 01 09:33:24 ms-be2066 object-server[3044374]: Unable to bind to port 6012: > [09:34:43] oh, red herring, that's not the affected port, and it got there in the end [09:35:23] right, that fixed it on that node, so I'll do it on the others. [10:06:23] Emperor: hah! so a reboot did it? or a restart object-server ? [10:19:35] I restarted swift-* because I wanted a moderately-sized hammer :) [10:20:09] fair! [10:24:02] ok so I freed some space on thanos-be2004, though I think depending on how things shuffle the isn't enough space for the tegola containers [10:24:11] :( [10:24:26] also, eqiad dispersion report picked up way too many unmounted disks, going to investigate [10:25:07] sigh, I'm assuming for ms cluster ? [10:25:13] yep [10:26:18] FCOL, some of this is the new nodes have stupid disks. ms-be1071 has 2x swift-sdl1 and 0x swift-sd01 [10:27:34] sigh [10:28:02] I'll fix it, but :sadface: [10:43:48] ok for T314275 I don't see many other short term solutions but to shrink sd[ab]4 and grow sd[ab]3 [10:43:49] T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275 [10:58:27] that's going to be a bit painful, isn't it? [11:02:48] sorry, a bit painful? [11:03:36] well they're not LVM partitions, so changing them is going to be quite invasive? I think it's probably the only answer, though. [11:06:27] yeah I think we're going to have to lose the filesystem on "4" partition (not a lot of data on those, not a huge deal) though for the "3" partition I think we can extend the partition and grow the filesystem [11:08:20] I'll test ^ in pontoon after lunch [11:09:59] I think I've got all the ms drives back into a known-plausible state (except for the two waiting for repair) [11:10:41] nice! [11:10:45] * godog lunch, bbiab [15:15:22] Emperor: ok I got this out, I've tested it on thanos-be-01 and it looks like to me it does the right thing https://gerrit.wikimedia.org/r/c/operations/puppet/+/819095 [15:15:58] 👀 [15:21:56] I think I'd have been tempted to use shell for this, since it's so much gluing commands together :) [15:23:56] :)