[12:58:40] greetings [14:59:41] \o [15:24:45] i'm unsure...do we continue fighting with swift-s3 and s3-repository, or do i port the swift-repository we used last time to our current version? [15:27:12] i suppose its a question of abandoning the longer-term goal of having a prepared recovery process we can use when necessary [15:37:11] ebernhardson I think the latest problem is with the thanos swift infra itself, right? We're overloading it? [15:50:30] inflatador: not clear if we are overloading it or what, but elastic tries multiple times to download the snapshots and gets a different amount of data each time [15:50:51] suggests to me something is hanging up in the middle, some sort of network overload seems plausible [15:51:04] (but the cluster overview graphs that show network saturation don't seem bad) [15:52:34] ebernhardson if it's easy enough to port the old plugin, it's probably worth the effort just to have a comparison. I'm mainly curious if the same problems exist under Elastic 7.10 but I'm not sure of a good way to test that, since deployment-prep can't talk to thanos-swift [15:53:58] it's probably not too bad to port forward. It's more adjusting for updates in the elastic build system that have to be worked through, the actual code changes minimally [15:54:44] OK, in that case , I'd be in favor of trying the swift plugin. If nothing else, it gives us a point of comparison [15:55:07] ok, sounds like a plan [15:56:55] i suppose another option, does thanos-swift go through some sort of golang middleware related to thanos? Maybe it was never really meant/well tested with large files. Previously we talked to the primary media-storage swift cluster [15:59:55] ebernhardson I know it uses Minio for its S3 compatibility, but beyond that I don't know [16:01:41] swift itself has a cap for the size of a single file, but any middleware (we use Minio) "should" handle the file splitting/large object manifest stuff automatically. [16:02:12] elastic also splits the files, although the smallest chunk_size i've used is 1gb. Will try kicking one off with something silly like 50mb and see if it does anything different [16:02:14] anyway, I don't think that's the issue but we can always ask the data persistence team to take a look [16:04:40] digging around in the elasticsearch-snapshot , I do see stuff like "indexname.part0" and "indexname.part1" , guessing that is ES doing the split [16:10:01] so the size of the part does match your chunk size (1 GB), I don't think we're hitting any internal limits for individual chunk size in swift, we'd get an error code back if so [16:10:43] hmm, yea that seems sane. Warning though i had elastic start deleting the snapshot thats there so it can start a new one fresh [16:11:09] ah yeah, that's fine. Default max single-object size for swift is 5 GB FWiW [16:12:45] not really convinced it will do anything, but since it's just issuing a couple commands and doing other things while waiting will run another snapshot/restore with lowered settings [16:13:49] worth a shot. FWiW the bullseye upgrade is finished in cloudelastic and we're back to green [16:14:00] at least something is going right :) [16:17:12] yeah, although it brings to light some NIC firmware issues...wish it would just automatically fix those on reimage, but that's another discussion ;( [17:34:20] well, it's certainly splitting up into lots of tiny files now. random hope that helps [17:45:28] more chances for it to report the wrong file size ;P [18:00:21] lol [19:18:38] random airflow ideas, right now we configure inputs for each table to read in each place its read. Maybe we could have some sort of centralized bit that we configure once per table about how to read that table, and then functions that create appropriate sensors [19:41:31] hmm, cirrus errors just spiked, but my script to check cross-cluster seems fine. Something else is complaining. [19:43:50] various index_not_found exceptions, which are what cross-cluster was erroring with before. running scripts/check_indices.py from cirrus which will figure out all the indices that should exist, and then report if anything is missing [19:44:40] I see an alert for 'Logstash Elasticsearch indexing errors' , that's not related is it? [19:46:05] inflatador: shouldn't be, they are separte clusters [19:46:48] this check script takes a moment, it needs maybe half a sec * 900 wikis to source what should exist [19:46:54] inflatador: ^ correct, stuff with logstash in the name refers to o11y's cluster [19:47:56] on the upside, i see commonswiki_file on both clusters so it's not that :) [19:49:22] hmm, and now i got the recovery email. not sure what would make it a temporary problem, grafana shows it errored for about 15 minutes and then stopped [19:52:28] nothing surprising in the output, seems sane [19:53:02] i suppose will make a note to check into whatever that was later. Plenty of logstash logs generated [19:57:29] gotxh [19:57:38] err, gotcha [19:58:04] assuming logstash itself wasn't borked during the problems ;) [19:58:43] i assume backpressure from logstash ES wouldn't cause this type of problem [20:00:11] guessing not, or there would be a lot more alerts [20:00:41] yea it shouldn't. There is certainly some level of error logging that would cause issues for them, but we only generated 50k logs in 10 minutes which is alot, but nothing crazy [20:01:02] and those all go into kafka first [20:13:59] restore for commonswiki_file started again on cloudelastic. It's being pretty gentle now, 75-100MB/s across the cluster to swift [20:15:13] it still wants to do 18 shards at a time, but each shard going slowly [20:30:47] maybe optimistic, it's been running for awhile and it hasn't given a shard failure/retry message yet (prev would fail at least one within 10 min). But maybe it's just because i made it run slower and now it needs 40 min to get to the first failure [20:41:08] hopefully not, but I guess we'll see [20:42:14] anyone know if we are deliberately not using 10G NICs in any hosts? Like we have the 10G NIC but it's unconfigured? I found this one https://netbox.wikimedia.org/dcim/devices/2240/ , just wondering if that is because it's in a 1G only rack or something [20:45:45] inflatador: there was a set of servers we recieved and there weren't enough 10G ports to plug them in, so they got 1G since we still had 1G hosts in the rest of the cluster [20:46:31] we should ask them to plug them in now (i think we put them on some list so they would have enough ports to plug them in, but unsure) [20:49:02] ebernhardson got it. Just looking at NIC firmware and noticed some boxes don't have it as their primary interface [20:52:01] also i could be mistaken, but my understanding is we dont have top-of-rack routing like in many other places, instead a whole row is wired into per-row routers of some sort. But that could be dated information [21:02:17] ryankemper up at meet.google.com/bib-jnpp-xdq whenever. ebernhardson feel free to drop by if you want to watch us torture facter with ansible ;P [21:04:21] see also https://gitlab.wikimedia.org/repos/search-platform/sre/nic-firmware-audit-t312298 [21:52:22] * ebernhardson wonders about changing airflow status colors, can barely tell the difference between lime and gold [21:53:34] sadly pink and turqoise also look like the same color, maybe i need new eyes instead :P [21:53:45] (all colors airflow uses to show task status) [21:54:17] * ebernhardson thought pink and turqoise were both shades of gray until opening the css inspector