[00:00:45] the part i don't get is, the Content-Length comes from swift. Would think swift would have some of its own checks if somehow the content-length doesn't match the object it stored [00:00:57] ebernhardson: so is it safe to say we've ruled out the content length error being caused by stop-the-world gc (given that it's not occurring frequently anymore and we're still seeing the error)? [00:01:06] ryankemper: i think so, yes [00:01:37] before i was thinking the http connections were just hanging up while the GC ran and it didn't respond for sometimes a couple minutes, but that doesn't seem to be the case [00:03:33] i'm wary to kill it though, i suppose will wait for it to try it's best and restore what it can before evaluating [00:11:07] Feels like the error might be on the swift-s3 side of things but not quite sure how to troubleshoot [00:11:31] But the content length thing makes me think that s3 is promising to send `1069931009` bytes and only actually sending `734068736` [00:13:04] yea, hmm. I wonder if we can try pulling a random file out with the swift cli client, have to try and match up with some error message [00:14:44] looks like they've had enough problems that elastic add (sometime in 7.x) a repository _analyze endpoint that looks for incorrect behaviour [00:14:51] "here are a large number of third-party storage systems available, not all of which are suitable for use as a snapshot repository by Elasticsearch. Some storage systems behave incorrectly, or perform poorly, especially when accessed concurrently by multiple clients as the nodes of an Elasticsearch cluster do." [00:15:35] looks like it mostly uploads and downloads a variety of blobs to see if it gives back what it stored [00:16:39] we could maybe try setting chunk_size to something silly, like 100MB and see if it complains less [00:17:03] defaults to unlimited, whatever size the source files on disk were [00:18:02] interesting elastic does keep trying, it's failed a few shards 5 times now [00:18:21] trying to figure out if they always fail on the same part of the shard... [00:21:36] fwiw looking at 5 failures on shardId 14, each time has a different amount of data received [00:23:12] each time it expects the same data size, and each time the total size received is different [00:25:23] ryankemper: any idea where swift logs things? manifests/site.pp in puppet suggests the data is stored on thanos-be100[1234].eqiad.wmnet but not seeing those mentioned in logstash [00:25:35] wondering if swift is saying anything [00:27:14] the thanos-store grafana dashboard doesn't suggest it's complaining [00:27:36] but that might not be swift, it's not particularly clear to me [00:28:58] the graphs make it look like it's been working harder sinze 23:30 though, which lines up reasonably with when the restore started [00:32:23] can try and turn down the parallelism and see if pushing thanos less makes it happier, i dunno :S [00:46:21] oh, thats a suspicious one. Thanos backend disk utilization rates go from ~50% to almost a steady 100% as soon as we start the restore [00:46:49] going to try and throttle it down by limiting restore bytes/sec and concurrent restores (and going to delete the commonswiki_file index it's still trying to restore) [00:52:09] Sounds ressonable [00:53:00] it still doesn't really make sense that it would close the connections without sending all the data during excessive disk-io, but i'm lacking good ideas :P [00:57:19] also stopping the restore doesn't seem to have reduce thanos disk io. Going to wait a bit before starting a new restore i guess and see if it calms down [01:30:35] heh, went away for a bit and came back. disk io is lowering but i suspect pushed it pretty hard. Will poke it after dinner [01:31:18] looksoh, actually it loks like it does this every day starting before 00:00, maybe thats daily compaction and unrelated to us [01:31:36] maybe not every day, i dunno. either way will try later [02:00:39] unrelated curious error from elastic2049: Likely root cause: java.nio.file.FileSystemException: /srv/elasticsearch/production-search-psi-codfw/nodes/0/node.lock: Read-only file system [02:08:27] seems to have started july 3, maybe a bad disk on controller or something. who knows [09:46:44] * cormacparle waves [09:47:22] I read through the backscroll here and it looks like you were trying to sort out the commonswiki_file index, but I don't really understand where you go to [09:47:32] is there anyone in search online who can enlighten me? [13:04:20] gretings [13:11:37] cormacparle I'm reading up through the scrollback, more details in https://phabricator.wikimedia.org/T309648 [13:58:29] relocating to my sister's house, back in ~45 [16:17:04] @Trey314159: @ebernhardson : oops, my pc shut down. joining in a minute [16:21:57] e-bernhardson ryankemper I'm starting to reimage cloudelastic to bullseye now, will do one host at a time via SRE hosts reimage cookbook, manually moving to next host once cluster goes back to green [16:23:03] inflatador: ebernhardson: howcome not the rolling operation cookbook? [16:24:59] ryankemper we just added the reimage flag and it didn't work last time I tried it. Do you think I should try it again? I'm fine with that [16:25:48] inflatador: yeah let's try it once and then we can switch to manual if it fails spectacularly. IIRC last time it failed was due to BIOS stuff but i don't fully remember [16:28:01] ryankemper ACK, will do. For context, e-bernhardson and I talked this morning about putting the restore on hold, and getting cloudelastic and prod up to bullseye before we revisit. If you have concerns about this let us know [16:31:12] dropping off kids, back in ~15 [17:03:13] sorry, been back [17:07:02] FYI I am restarting elastic services on our lone bullseye server ( cloudelastic1006 ) so we can use --start-datetime flag on the reimage operation and avoid reimaging it again [17:12:56] cloudelastic is now back in red?!? Not sure what happened there, but watching the reallocation [17:16:47] inflatador: looking [17:19:12] inflatador: `red open commonswiki_file_1647921177 XiIkjW1UTa-xOikTC7CIZQ 32 0 20915531 8146704 261.8gb 261.8gb` Looks like the commonswiki_file index was created without any replication, so that makes sense [17:19:28] inflatador: I think we should blow away the index and create a new one in its place, and then set the replica count to 2 [17:31:17] go ahead and delete commonswiki_file, that was me attempting a restore this morning from a ne wsnapshot with increased throttling both on snapshot and restore, but didn't work [17:31:17] ebernhardson ACK will do [17:31:17] OK, deleted commonswiki_file_1647921177 , we're back to yellow [17:31:17] quick lunch, back in ~20 [17:48:34] back [18:10:56] started the reimage via rolling-operation cookbook per ryankemper suggestion, it's on cloudelastic1003 atm [18:43:14] bah, cloudelastic1003 is booting in to interactive mode installer [18:44:15] :( [18:44:55] This happened when we did cloudelastic1006, checking my notes for the fix there [19:05:57] i wish more developers were on the "explain why" bandwagon. It looks like the fulltext head queries dashboard broke because superset changed from templating being default-on in pre-1.0 to default off in 1.0. But the patch makes no attempt to explain *why* they turned it off (but i could guess): https://github.com/apache/superset/pull/11172 [19:06:32] * ebernhardson submits short patch and hopes analytics deploys it :P [20:01:32] relocating, back in ~15 [20:25:55] running the wikidata-query-rdf-maven-release-docker job in jenkins, iirc thats the one that will release a new version of all the jars in wikidata/query/rdf [20:34:55] for some reason i always find these notes in wikitech amusing: This page was last updated in 2015 and may be outdated. Please update it if you can. [20:35:56] hmm, failed in the test suite: RdfClientIntegrationTest.retriesOnTimeout » Unexpected exception [20:45:50] * ebernhardson runs the tests locally. Then remembers have to kill chrome to prevent OOM when building some java things [20:48:10] +1 for "explain why".. though I tend to perhaps over-document [20:48:39] no such thing! Also, back [21:02:35] running firmware updates on cloudelastic1003 , will keep you posted on status [21:23:14] worked the second time to release the rdf jars, we have intermittently failing tests :( [21:26:47] ryankemper Firmware update for cloudelastic1003 , expect it to take up to 3 more hrs based on internet chatter. Details on how to check progress at https://phabricator.wikimedia.org/P30932 [21:27:20] inflatador: excellent, thanks. will monitor [21:28:57] ryankemper np, I'm doing a half day tomorrow (morning only) will check in w/you then [21:29:16] inflatador: sounds good, enjoy the time w/ family! [22:54:38] interesting - No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more. [23:03:34] the code for NLLB is at https://github.com/facebookresearch/fairseq/tree/nllb/ [23:05:22] i'm not entirely sure what we we do with it, probably nothing in search, but i like the idea (perhaps too much NIH) to host translations ourselves instead of outsourcing to google's api's [23:06:18] maybe needs a different term, NHH - not hosted here [23:06:44] machine translation is never ideal, but yeah if we could use this to help assist human translations and also avoid closed source projects that would be neat. [23:08:18] I've been in the room when Facebook rug pulled us before though (hhvm php compat), so I'm extra skeptical about their FOSS commitments [23:09:14] yea, hhvm getting pulled was a disappointment. It didn't get the traction they were hoping for so they killed the php side of things to make developing their own language easier [23:09:36] not sure what they would do here, certainly there is no guarantee they keep releasing updated models that cost potentially millions to build [23:14:30] It looks like the underlying tech stack for this (PyTorch + "fairseq") only supports nvidia GPUs at the moment -- https://github.com/facebookresearch/fairseq#requirements-and-installation -- which makes it not easy tech for us to train models with until someone builds a FOSS user space for nvidia GPUs. (Their kernel extensions are FOSS now, but all the real magic is in the user space bits like CUDA)