[08:58:41] hey @paladox it seems like all your swiftobject11* stores ran out of disk space in the last ~12 hours [08:59:10] tbh if i were y'all i would figure out why swift is working SO hard, none of it really makes sense imo [10:11:13] fellas, problems w/ files continue per reports in #general and #support (upload, moving, cache) [10:11:26] any update? [10:12:19] I can only guess that Paladox is the only person who can deal w/ that? [11:02:45] Oh. I didn’t look but if swift is out of storage somebody might want to disable uploads or something [11:02:52] Not too many people that can do anything about that though [11:07:24] then we shall wait ... [11:42:33] the disks on each of swiftobject* look like this [11:45:58] swiftobject122 theoretically has a bit more space actually [11:46:37] did you guys do anything around june 4th to rejigger the available space? [11:47:15] data dumps tho? when the panic started? [11:47:46] june 4th predates the initial panic [11:47:58] and the spike today is sort of after everything subsided [11:48:11] it feels like a whole month passed ... [11:48:26] I see [13:43:56] Iirc they were pretty recently either expanded or some space was cleaned up but idk when that was [13:47:33] we changed the weight to put more preasure on the largest node because some had fill up. [13:48:59] i changed the weight yesturday but seems that buggered things. I need to speak to void when he's online [13:55:24] If you want I can try to help audit your usage [13:55:39] we have different disk sizes [13:55:56] we have 2 SATA disks that we cannot use as they are slow asf [13:56:08] we have some disks in cloud10 [13:56:27] but i need to speak to void because we want to move off cloud10 and well moving the HDD may be difficult [13:56:28] As an emergency thing you could delete thumbnails that haven't been accessed in thy last day. Like how much of your disk space is actually taken up by originals? [13:56:49] but i can setup some vms which would relieve the preasure if he ok's it [13:56:58] Before throwing resources at it, I would suggest trying to audit the usage [13:57:10] you cannot delete when the disk is full [13:57:56] swift keeps files around for a bit (replication thing). I lowed it from 7 days to 6 hours. [13:58:08] https://github.com/miraheze/puppet/blob/master/modules/swift/templates/object-server.conf.erb#L22 [13:58:53] we don't actively use replication (we only store one not >1). It's used to load balance data. [14:00:22] I’m assuming deleting raw disk contents would be bad but does swift not have a way to access it and be like “delete this file, for real” ? [14:01:17] well you can delete on disk using swift-object-info to find out what the file is for [14:02:13] Can you find out how many objects you have in swift and how many of them are thumbs? [14:02:25] And what the total size is for each group [14:12:32] https://grafana.miraheze.org/d/OPgmB1Eiz/testing-swift?orgId=1&viewPanel=9 [14:12:46] looks like 4.8 mil objects [14:14:55] it's going to be difficult to find out how many are thumbs [14:15:57] there's 8408 thumb containers [14:25:13] That sharp drop is interesting [14:26:16] @paladox is there still a public/guest login for icinga? [14:26:31] i don't think so [14:26:35] Rip [14:26:42] we moved it to use ldap [14:27:15] grafana should no longer go down when the db goes down 😄 [14:30:38] It’s fineeeee until ldap goes down [14:36:46] well it should still work [14:39:02] If we're trying to move off cloud10, then db101 db112 need to move at some point, I believe those are the only two vms on there at the moment [14:49:30] we carn't it requires someone to go to the DC. Disks need installing on cloud13/14 [15:53:45] interesting, do you know the relative sizes of the disks on the different swift containers? [15:53:56] because it seems like you have a lot more than 1400GB of space available [16:58:33] also confused by that graph if you zoom out to ~6 months [18:48:06] <.labster> This is a recurring problem, yes? Do we have alerts set up? [18:52:33] AFAIK, Icinga does monitor disk space [18:54:51] Yeh we monitor. But it’s pointless when a) we’ve run out of total disk space and b) when Swift doesn’t work well with different sized hard drives (unless you use weights properly I guess). I thought using 100 would correctly do things for us but nope. [18:55:03] @.labster [18:56:32] <.labster> There’s so much to unpack [18:58:08] Well we have cloud10 as I explained earlier but it’s more complicated and need to discuss with void [19:17:14] i strongly suggest trying to figure out what is actually taking up that space, or understanding why the graphs are the way they are [19:17:32] what's the total disk space available to the swift servers? [19:19:56] There are 5 object servers: 3 have 558GB disk 1 has 498GB disk and 1 has 931GB disk [19:23:28] so you have 3100 GB of space available, but the monitoring says you have ~1400GB of files stored [19:23:42] and you're basically completely out of disk space [19:24:04] where's the discrepancy? [19:24:43] system reserved [19:25:27] Half-full, not half-empty I geuss 😛 [19:26:42] the graph on also looks really weird if you zoom to 6 months, any idea what's going on there? [19:28:40] Oh I think the stats may be wrong because of the outage we had [19:28:56] Both disks failed causing corruption for the ac server [19:29:30] According to icinga swiftobject122 still has ~200GB of disk available [19:29:30] Files are still accessible if you know the full path but may not show in listings (containers) [19:29:59] Huh [19:35:16] Ac server? [19:35:25] account/container [19:35:32] Ahh [19:35:36] uses ssds [20:16:18] We've got a plan to basically delete every file in local-thumb containers (these can all be regenerated from the original file, iirc) and repair the listing for every other container [20:38:29] Could it be an idea to purge very very old datadumps? [20:39:05] Maybe? [20:39:07] But we [20:39:34] Or at the very least old data dumps where new ones exist [20:39:41] But we'd have to find them first, which would be about as much effort [20:39:52] Can't generate a new data dump without deleting the old one [20:39:59] Ah ok [20:41:24] For the record there's about 1381902 local-thumb files on swiftobject111 alone, and I already know how to find them. [20:46:20] i also assume thumbs are the smallest files on an individual basis but hopefully clearing them all and letting them be regenerated on demand buys some time [21:02:22] i've done https://github.com/miraheze/puppet/commit/c841ad714a8fce7f8a3afa6a2761b66cc3426f04 to try and prevent a full disk scenario with rsync in the future [21:07:30] ok there's some fixes that need doing [21:57:31] be very very careful about this [21:57:43] it might completely overwhelm your mediawiki servers [21:58:40] actually i think it will probably be okay, but you should do it at whatever the low traffic point of the day is [21:59:01] and if i were you i would grep the nginx logs for some mw* server and get a sense of how many thumbs are being generated per (say) minute [21:59:28] and please tell me what that number is. i think it's only gonna be about 300 a minute so it's probably okay, but if for some reason it's higher, you may need to be more careful with the plan [21:59:46] [1/2] > For the record there's about 1381902 local-thumb files on swiftobject111 alone, and I already know how to find them. [21:59:46] [2/2] do you know how much space these take up total, compared to the non-thumb images? [22:02:09] I don't have an easy way to guess this without running a script that would take weeks to complete. [22:02:51] The script to delete the files should be running rather slowly, if the list generation is anything to go by [22:03:07] I dont think regenerating thumbs would be a huge issue, i would just do it late at night [22:03:26] I don't think we'd have a huge problem deleting them, but I'm not sure. [22:03:47] Worse case scenario we have a few moments of high load on mw* [22:04:28] Timing is a bit of a non-issue here, the script is probably going to be slow, which means it will be running for several days/weeks. Trying to only run it at particular times just doesn't seem possible. [22:05:50] Yeah should be fine to run whenever then [22:27:47] Yeah that's a fair point [22:30:04] Although I think it will be a fair bit faster than you're anticipating - deleting from a HDD is like orders of magnitude faster than writing