[00:00:45] <ebernhardson>	 the part i don't get is, the Content-Length comes from swift. Would think swift would have some of its own checks if somehow the content-length doesn't match the object it stored
[00:00:57] <ryankemper>	 ebernhardson: so is it safe to say we've ruled out the content length error being caused by stop-the-world gc (given that it's not occurring frequently anymore and we're still seeing the error)?
[00:01:06] <ebernhardson>	 ryankemper: i think so, yes
[00:01:37] <ebernhardson>	 before i was thinking the http connections were just hanging up while the GC ran and it didn't respond for sometimes a couple minutes, but that doesn't seem to be the case
[00:03:33] <ebernhardson>	 i'm wary to kill it though, i suppose will wait for it to try it's best and restore what it can before evaluating
[00:11:07] <ryankemper>	 Feels like the error might be on the swift-s3 side of things but not quite sure how to troubleshoot
[00:11:31] <ryankemper>	 But the content length thing makes me think that s3 is promising to send `1069931009` bytes and only actually sending `734068736`
[00:13:04] <ebernhardson>	 yea, hmm. I wonder if we can try pulling a random file out with the swift cli client, have to try and match up with some error message
[00:14:44] <ebernhardson>	 looks like they've had enough problems that elastic add (sometime in 7.x) a repository _analyze endpoint that looks for incorrect behaviour
[00:14:51] <ebernhardson>	 "here are a large number of third-party storage systems available, not all of which are suitable for use as a snapshot repository by Elasticsearch. Some storage systems behave incorrectly, or perform poorly, especially when accessed concurrently by multiple clients as the nodes of an Elasticsearch cluster do."
[00:15:35] <ebernhardson>	 looks like it mostly uploads and downloads a variety of blobs to see if it gives back what it stored
[00:16:39] <ebernhardson>	 we could maybe try setting chunk_size to something silly, like 100MB and see if it complains less
[00:17:03] <ebernhardson>	 defaults to unlimited, whatever size the source files on disk were
[00:18:02] <ebernhardson>	 interesting elastic does keep trying, it's failed a few shards 5 times now
[00:18:21] <ebernhardson>	 trying to figure out if they always fail on the same part of the shard...
[00:21:36] <ebernhardson>	 fwiw looking at 5 failures on shardId 14, each time has a different amount of data received
[00:23:12] <ebernhardson>	 each time it expects the same data size, and each time the total size received is different
[00:25:23] <ebernhardson>	 ryankemper: any idea where swift logs things?  manifests/site.pp in puppet suggests the data is stored on thanos-be100[1234].eqiad.wmnet but not seeing those mentioned in logstash
[00:25:35] <ebernhardson>	 wondering if swift is saying anything
[00:27:14] <ebernhardson>	 the thanos-store grafana dashboard doesn't suggest it's complaining
[00:27:36] <ebernhardson>	 but that might not be swift, it's not particularly clear to me
[00:28:58] <ebernhardson>	 the graphs make it look like it's been working harder sinze 23:30 though, which lines up reasonably with when the restore started
[00:32:23] <ebernhardson>	 can try and turn down the parallelism and see if pushing thanos less makes it happier, i dunno :S
[00:46:21] <ebernhardson>	 oh, thats a suspicious one. Thanos backend disk utilization rates go from ~50% to almost a steady 100% as soon as we start the restore
[00:46:49] <ebernhardson>	 going to try and throttle it down by limiting restore bytes/sec and concurrent restores (and going to delete the commonswiki_file index it's still trying to restore)
[00:52:09] <ryankemper>	 Sounds ressonable
[00:53:00] <ebernhardson>	 it still doesn't really make sense that it would close the connections without sending all the data during excessive disk-io, but i'm lacking good ideas :P
[00:57:19] <ebernhardson>	 also stopping the restore doesn't seem to have reduce thanos disk io. Going to wait a bit before starting a new restore i guess and see if it calms down
[01:30:35] <ebernhardson>	 heh, went away for a bit and came back. disk io is lowering but i suspect pushed it pretty hard. Will poke it after dinner
[01:31:18] <ebernhardson>	 looksoh, actually it loks like it does this every day starting before 00:00, maybe thats daily compaction and unrelated to us
[01:31:36] <ebernhardson>	 maybe not every day, i dunno. either way will try later
[02:00:39] <ebernhardson>	 unrelated curious error from elastic2049: Likely root cause: java.nio.file.FileSystemException: /srv/elasticsearch/production-search-psi-codfw/nodes/0/node.lock: Read-only file system
[02:08:27] <ebernhardson>	 seems to have started july 3, maybe a bad disk on controller or something. who knows
[09:46:44] * cormacparle waves
[09:47:22] <cormacparle>	 I read through the backscroll here and it looks like you were trying to sort out the commonswiki_file index, but I don't really understand where you go to
[09:47:32] <cormacparle>	 is there anyone in search online who can enlighten me?
[13:04:20] <inflatador>	 gretings
[13:11:37] <inflatador>	 cormacparle I'm reading up through the scrollback, more details in https://phabricator.wikimedia.org/T309648 
[13:58:29] <inflatador>	 relocating to my sister's house, back in ~45
[16:17:04] <tanny411>	 @Trey314159: @ebernhardson : oops, my pc shut down. joining in a minute
[16:21:57] <inflatador>	 e-bernhardson ryankemper I'm starting to reimage cloudelastic to bullseye now, will do one host at a time via SRE hosts reimage cookbook, manually moving to next host once cluster goes back to green
[16:23:03] <ryankemper>	 inflatador: ebernhardson: howcome not the rolling operation cookbook?
[16:24:59] <inflatador>	 ryankemper we just added the reimage flag and it didn't work last time I tried it. Do you think I should try it again? I'm fine with that
[16:25:48] <ryankemper>	 inflatador: yeah let's try it once and then we can switch to manual if it fails spectacularly. IIRC last time it failed was due to BIOS stuff but i don't fully remember
[16:28:01] <inflatador>	 ryankemper ACK, will do. For context, e-bernhardson and I talked this morning about putting the restore on hold, and getting cloudelastic and prod up to bullseye before we revisit. If you have concerns about this let us know
[16:31:12] <inflatador>	 dropping off kids, back in ~15
[17:03:13] <inflatador>	 sorry, been back
[17:07:02] <inflatador>	 FYI I am restarting elastic services on our lone bullseye server ( cloudelastic1006 ) so we can use --start-datetime flag on the reimage operation and avoid reimaging it again
[17:12:56] <inflatador>	 cloudelastic is now back in red?!? Not sure what happened there, but watching the reallocation
[17:16:47] <ryankemper>	 inflatador: looking
[17:19:12] <ryankemper>	 inflatador: `red    open commonswiki_file_1647921177       XiIkjW1UTa-xOikTC7CIZQ 32 0  20915531  8146704  261.8gb  261.8gb` Looks like the commonswiki_file index was created without any replication, so that makes sense
[17:19:28] <ryankemper>	 inflatador: I think we should blow away the index and create a new one in its place, and then set the replica count to 2
[17:31:17] <ebernhardson>	 go ahead and delete commonswiki_file, that was me attempting a restore this morning from a ne wsnapshot with increased throttling both on snapshot and restore, but didn't work
[17:31:17] <inflatador>	 ebernhardson ACK will do
[17:31:17] <inflatador>	 OK, deleted commonswiki_file_1647921177 , we're back to yellow
[17:31:17] <inflatador>	 quick lunch, back in ~20
[17:48:34] <inflatador>	 back
[18:10:56] <inflatador>	 started the reimage via rolling-operation cookbook per ryankemper suggestion, it's on cloudelastic1003 atm
[18:43:14] <inflatador>	 bah, cloudelastic1003 is booting in to interactive mode installer
[18:44:15] <ebernhardson>	 :(
[18:44:55] <inflatador>	 This happened when we did cloudelastic1006, checking my notes for the fix there
[19:05:57] <ebernhardson>	 i wish more developers were on the "explain why" bandwagon. It looks like the fulltext head queries dashboard broke because superset changed from templating being default-on in pre-1.0 to default off in 1.0. But the patch makes no attempt to explain *why* they turned it off (but i could guess): https://github.com/apache/superset/pull/11172
[19:06:32] * ebernhardson submits short patch and hopes analytics deploys it :P
[20:01:32] <inflatador>	 relocating, back in ~15
[20:25:55] <ebernhardson>	 running the wikidata-query-rdf-maven-release-docker job in jenkins, iirc thats the one that will release a new version of all the jars in wikidata/query/rdf
[20:34:55] <ebernhardson>	 for some reason i always find these notes in wikitech amusing: This page was last updated in 2015 and may be outdated. Please update it if you can.
[20:35:56] <ebernhardson>	 hmm, failed in the test suite: RdfClientIntegrationTest.retriesOnTimeout »  Unexpected exception
[20:45:50] * ebernhardson runs the tests locally. Then remembers have to kill chrome to prevent OOM when building some java things
[20:48:10] <Trey314159>	 +1 for "explain why".. though I tend to perhaps over-document
[20:48:39] <inflatador>	 no such thing! Also, back
[21:02:35] <inflatador>	 running firmware updates on cloudelastic1003 , will keep you posted on status
[21:23:14] <ebernhardson>	 worked the second time to release the rdf jars, we have intermittently failing tests :(
[21:26:47] <inflatador>	 ryankemper Firmware update for cloudelastic1003 , expect it to take up to 3 more hrs based on internet chatter. Details on how to check progress at https://phabricator.wikimedia.org/P30932
[21:27:20] <ryankemper>	 inflatador: excellent, thanks. will monitor
[21:28:57] <inflatador>	 ryankemper np, I'm doing a half day tomorrow (morning only) will check in w/you then
[21:29:16] <ryankemper>	 inflatador: sounds good, enjoy the time w/ family!
[22:54:38] <ebernhardson>	 interesting - No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more. 
[23:03:34] <bd808>	 the code for NLLB is at https://github.com/facebookresearch/fairseq/tree/nllb/
[23:05:22] <ebernhardson>	 i'm not entirely sure what we we do with it, probably nothing in search, but i like the idea (perhaps too much NIH) to host translations ourselves instead of outsourcing to google's api's
[23:06:18] <ebernhardson>	 maybe needs a different term, NHH - not hosted here
[23:06:44] <bd808>	 machine translation is never ideal, but yeah if we could use this to help assist human translations and also avoid closed source projects that would be neat.
[23:08:18] <bd808>	 I've been in the room when Facebook rug pulled us before though (hhvm php compat), so I'm extra skeptical about their FOSS commitments
[23:09:14] <ebernhardson>	 yea, hhvm getting pulled was a disappointment. It didn't get the traction they were hoping for so they killed the php side of things to make developing their own language easier
[23:09:36] <ebernhardson>	 not sure what they would do here, certainly there is no guarantee they keep releasing updated models that cost potentially millions to build
[23:14:30] <bd808>	 It looks like the underlying tech stack for this (PyTorch + "fairseq") only supports nvidia GPUs at the moment -- https://github.com/facebookresearch/fairseq#requirements-and-installation -- which makes it not easy tech for us to train models with until someone builds a FOSS user space for nvidia GPUs. (Their kernel extensions are FOSS now, but all the real magic is in the user space bits like CUDA)