[15:02:40] * cormacparle waves [15:03:03] ebernhardson: I sent this to you already, just posting it here too so it gets a wider audience [15:04:13] cormacparle: hmm, thats quite odd :S We should have the full text of every update sent through the analytics->mjolnir->elastic pipeline. I can poke through that to see if the page id comes up [15:05:10] ah cool! you're online! [15:05:15] if anything came through the cirrussearch side though, via the DataSender::sendUpdateWeightedTags [15:05:20] if anything came through the cirrussearch side though, via the DataSender::sendUpdateWeightedTags then we wont really have logs [15:05:25] cormacparle: its 8am, just starting :P [15:05:35] lucky timing for me ... [15:05:37] (i start ~8am most days) [15:06:31] 👍 [15:08:29] hmm, first annoying thing is zgrep claims -r (recursive) is not supported making this more annoying :P sec [15:08:46] maybe find and xargs or some such [15:08:47] oh dear [15:09:24] I'll post the context here jsut so everyone else has it: [15:09:24] SD is writing updates for `weighted_tags` relevant for image suggestions to the hive table `analytics_platform_eng.images_suggestions_search_index_delta` and Erik has imported them. Turns out that some of the data we've written is wrong, but also that data hasn't imported correctly - there is data in the indices that doesn't correspond to what we have [15:09:42] here's an example - this page has a `recommendation.image/exists|1` in `weighted_tags` https://en.wikipedia.org/wiki/index.php?curid=2430357&action=cirrusDump [15:09:53] but there's nothing in any of the snapshots for either `image_suggestions_search_index_full` or `image_suggestions_search_index_delta` that has wikiid=enwiki and page_id=2430357 [15:10:21] > if anything came through the cirrussearch side though, via the DataSender::sendUpdateWeightedTags then we wont really have logs [15:10:27] what might send data that way? [15:11:15] i'm kinda surprised how many id's there are, so far it's found updates for that page id in warwiki, ruwiki, enwikisource, frwiktionary, nlwiki, and now enwiki [15:11:19] date=20220102/freq=weekly/part-00180.gz:{"update": {"_index": "enwiki_content", "_type": "page", "_id": 2430357}} [15:11:44] but i don't know thats weighted tags, these are generic update.s I have to wait for it to finish and then re-check specific things [15:12:43] cormacparle: if you haven't written any php that calls it then it shouldn't be invoked, i know tg*r has used it for a few things [15:12:58] grand [15:13:30] no php involved in our work on this [15:16:49] tweaked the cmdline a few times, it's now running but i suspect could take some time. It's running throught the /mnt/hdfs which is always a bit slow, and there are ~1k files per week for it to poke through [15:17:01] for ref: find /mnt/hdfs/wmf/data/discovery/transfer_to_es/date=2022*/freq=weekly/ -type f | xargs zgrep -A 1 '"enwiki_content", "_type": "page", "_id": 2430357' [15:17:55] oh wait, i'm being silly...you shouldn't have had an weekly ones yet just manual...that might be faster [15:21:21] cormacparle: as for clearing them, sadly best we can probably do is run a scroll through the index to collect the set of matching pages to re-compare against something. A scroll is pretty easy to run through with the python elasticsearch client (for whatever reason the php one makes things more tedious) [15:24:09] hmm ok [15:25:06] cormacparle: maybe https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/scripts/cleanup_weighted_tags.py could do what you need? [15:25:18] or maybe not, it's a bit brute force [15:26:15] hah! looks perfect [15:27:48] if we cleared out all existing `recommendation.image/exists` values then we could do a full import from one of the hive tables and at least we'd know the data on our end would be correct [15:28:51] cormacparle: seems reasonable [15:29:03] specifically you could do the import from `analytics_platform_eng.image_suggestions_search_index_full` with snapshot='2022-06-20' [15:30:27] also it's grep'd through the updates shipped in june, didn't find anything for that page id from image recommendations :S [15:30:39] * cormacparle scratches head [15:30:42] checking may now, but i suspect we might have noticed if it was that long ago [15:34:22] trying to ponder how else it would get there, but not really sure. There are the api's in cirrus but those are mostly for local testing, we don't use them in prod. The updates themselves are gzip'd and shipped into swift, i imagine if some sort of corruption was happening it would fail decompression instead of flipping random numbers [15:35:16] I was wondering might the shape be wrong somehow, and it's converting a date field to a number or something? Or there's some kind of bigint->int conversion or other quirky thing happening? [15:36:49] the shape of the data in our hive table I mean [15:37:22] hmm, [15:42:04] i don't think so, poking over this update script the page_id is never casted. There is a merge step when multiple tables are provided that is potentially complicated, but that should be skipped here since we've been shipping one-off datasets and not automated ones that merge with others [15:43:05] ther eis an update for svwiki_content on that page id, but i can't think of how elastic would end up applying that to the wrong index (all cirrus updates are send the same way, over _bulk with scripted update) [15:49:07] There is a potential for overwriting the files on subsequent executions, but airflow claims it's only run convert_to_esbulk, which generates these files, once for each of the three import runs so far [15:50:56] something massively odd is happening here that I can't currently explain :( [15:59:46] don't think it could be the update daemon. It does try to parallelize things, but it parallelizes within a single file and a single file always has a single wiki within it, even if it somehow got mixed up the mixup would be within an index and not across indices... [16:01:58] (also we don't write that parallelism, we just provide the client with a thread count to use) [16:06:41] ok [16:06:45] * cormacparle strokes chin [16:08:17] I have to exit for the day now anyway ... do you think we should do the restoring-the-data step above (i.e. run your cleanup script, then re-import) now? or do you want to dig into the oddness further just in case a re-import covers up important clues? [16:08:24] i suppose randomly curious, while svwiki is the only one i'm seeing apply recommendation.image/exists|1 to pageid 2430357, it's not in the index currently: https://sv.wikipedia.org/wiki/index.php?curid=2430357&action=cirrusDump [16:08:43] cormacparle: sure i can [16:09:09] cool - I'll be off, talk to you tomorrow no doubt [16:13:19] actually svwiki makes sense, found an update that does __DELETE_GROUPING__ in the 2022-06-19 update (which offsets to the 2022-06-13 input) [17:03:03] dry run of clearing looks plausible, letting it run for real [17:14:04] hmm, there is no attempt in this script to parallelize anything, so while the dry-run is reasonably quick (< 30m) it looks like it's going to be a few hours to run the clearing operation [17:56:09] hmm, cloudelastic seems to really be struggling. I'd be tempted to blow away the existing commonswiki_file index [17:56:14] (since we are restoring the new one anyways) [17:56:18] could also bump the available heap [18:02:58] the logs aren't particularly clear, it says shard recovery failed because the data it downloaded from swift was smaller than the Content-Length claimed. Not seeing anything in logstash for thanos-* hosts, maybe they end up somewhere else [18:07:17] hmm, as soon as i delete the commonswiki_file index in cloudelastic though things are going to start really complaining on the mediawiki side. I suppose i should patch it up to stop sending updates while we figure this out [19:04:54] ryankemper: lemme know when you're arround, i suspect we have to bump cloudelastic-chi heap usage to get the old gc alerts under control and allow the snapshot restore to complete. Patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/811355 [19:13:57] looking at the network graphs for july 1, it seems taking a snapshot took ~1 hr. I'm going to drop the existing snapshots and start a new one, so that when it restores there is less of a delta that the saneitizer has to fix up [19:22:44] ebernhardson: around now [19:24:05] cool! i'm currently surprised that `curl -XDELETE http://localhost:9200/_snapshot/elastic_snaps/snapshot_t309648` has been running for multiple minutes now, for some reason i thought deleting a snapshot would be reasonably quick [19:24:15] (that one was the failed snapshot) [19:24:57] general idea right now is bump cloudelastic-chi heap from 45G to 55G, delete both commonswiki indices, take a new snapshot (took ~1hr last time) and then try and restore it [19:25:31] based on the (unproven) theory that the issues with files downloaded from swift not matching their Content-Length is somehow related to stop-the-world GC's happening constantly [19:26:40] ok the snapshot delete did finish, i should have wrapped it with `time` as i'm now curious how it took so long :P [19:27:43] oh also i have a mediawiki-config patch that will stop all writes mediawiki->cloudelastic for commonswiki. Sadly that's not a complete solution, other wikis try to write to commonswiki when local files are modified. I guess i'm going to hope that volume isn't too crazy. Maybe try the snapshot restore without deleting the live commonswiki index after bumping heap [19:28:22] the mw-config patch will go out with the backport window that runs from 1-2pm [19:30:01] ebernhardson: the puppet patch can go out now, right? https://gerrit.wikimedia.org/r/c/operations/puppet/+/811355 [19:30:14] ryankemper: yea, and will need to rolling-restart cloudelastic [19:30:47] ebernhardson: okay and just to sanity check, I'm clear to start the restart (after merging) now that the snapshot delete is done? [19:31:22] ryankemper: yea should be fine. I suppose maybe it took awhile because i ran the delete from codfw cluster instead of cloudelastic, so it has extra round trip time if it's deleting files from swift one at a time [19:36:26] i suppose for reference, i'm still expecting we will want to tune that back down once we are on 7.10, elastic moving term dictionaries off-heap should save us ~100G of heap across the cluster (if it works as advertised, who knows :) [19:38:02] ebernhardson: we're in red status right now on the main cloudelastic cluster, which is going to prevent the cookbook from working properly [19:38:36] ebernhardson: I could either do the restarts manually, or we could blow away the index entirely (I imagine we need that mw-config patch live first to prevent the index from respawning) [19:40:39] ryankemper: hmm, yea we'll need to wait for the config patch then, deleting it now would probably emit a significant level of errors [19:42:29] i had hoped to ship the config a bit earlier, but it was the train deploy window so couldn't really ship it out-of-band [19:55:12] poked at the code that writes to commonswiki_file from other wikis, no great solution available. But we typically run < 1 job/sec so i'm going to assume the level of errors isn't going to be too significant and trigger alerts [20:08:11] hmm, the snapshot failed :S (i haven't deleted the successfull one yet, though) [20:11:49] oh wow, because i'm totally not paying attention today :S Somehow i started the snapshot creation *from* cloudelastic. sigh [20:12:00] ok, trying again :P [20:13:07] :P [20:15:04] it's ok, i was reading through the logbook for facebooks new giant language model, they have a line like 'ABC accidentally fat-fingered the XYZ modification, deleted the whole cluster' [20:15:09] it's not just me :P [20:24:21] snapshot completed, claims 1.36TB in size which seems plausible (_cat/indices/commonswiki_file reports 1.3TB) [20:24:43] config patch is shipping now, just waiting for php-fpm-restarts [20:25:11] should be done, doing the "dangerous" part and dropping commonswiki_file indices on cloudelastic [20:25:26] making sure i'm really on cloudelastic by using proper hostnames instead of localhost :) [20:27:28] maybe should try and avoid using localhost under all circumstances, seems more open to errors [20:28:37] I should get out of that habit too [20:29:07] huh, thats a new one: curl -XDELETE https://cloudelastic.wikimedia.org:9243/commonswiki_file_1647920262 [20:29:09] {"acknowledged":false} [20:30:12] hmm, needs a longer master_timeout [20:30:33] yup was about to suggest that [20:31:25] ebernhardson: cluster's green now, any objection to me rolling the restart? [20:31:27] ryankemper: ok cloudelastic is green now, should be good for a restart [20:31:27] ebernhardson: see -operations please [20:32:58] cool, will wait 5 mins just to see the mw errors decrease [20:45:20] so, it turns out that the cloudelastic GC issues have resulted in a 3+ day backlog of jobs, so they aren't going to silence themselves. I'm going to deploy a patch the recognizes these errors and tells the job queue they worked [20:46:04] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/811372 [20:56:08] ryankemper: we also might want to turn the saneitizer off (in puppet) again, its managed to build up its backlog as well. Unfortunately saneitizer isn't isolated across clusters, one job has to query all three clusters + sql so having cloudelastic slowdown likely caused it to build its backlog [20:56:31] but stopping it will only prevent the backlog from growing more, hopefully these changes to cloudelastic will let things start running sanely again [20:56:37] ebernhardson: sounds reasonable [20:58:35] i suppose we should have delt with cloudelastic sooner, was perhaps too optimistic and should have pushed the memory last week [20:59:48] Yeah, with the benefit of hindsight it's clear that there were second-order effects we weren't fully aware of [21:03:00] yea for some reason i didn't even think of the backlog ... maybe we need some alerts on the backlogs as well so we are aware when we built up more than a couple hours [21:03:15] but those aren't really 'do something now' alerts, they are more like 'be aware this is happening' [21:05:18] Good candidate for alertmanager then [21:05:25] Can have it just email us (or post in our IRC or something) [21:07:01] looks like these metrics are in prometheus, yea an alertmanager makes sense [21:09:11] made a ticket (T312175) [21:09:12] T312175: Create alerts when cirrus jobs backlog - https://phabricator.wikimedia.org/T312175 [21:14:53] ebernhardson: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811374 for saneitizer [21:14:56] (pcc running now) [21:22:32] i suppose i could have thought of it earlier, another solution would have been to create an empty commonswiki_file index and let writes continue until the snapshot was restored [21:24:23] saneitizer would have loved that i suppose :P [21:27:24] sigh, actually my patch doesn't look like it resolves it, the error comes out even earlier than where i had it capture the error. Creating the empty index [21:41:03] brb 20 mins [21:42:45] didn't need to create the empty index, these will all error and re-queue now instead of making noise in the logging channels [22:01:36] ryankemper: things should be in a sane-ish state, at least ready for a cloudelastic cluster restart [22:10:00] Excellent, kicking off restart in a few [22:11:36] also wow, dropping commonwiki took the terms memory usage from 73 to 52G, or basically 1/3 [22:13:04] ryankemper: i dunno if you can control it, but might start with 1006 because it's struggling the most [22:13:36] 1005 got happier after dropping the index, and the other nodes aren't hitting the old gc alert, only 1006 right now [22:14:52] probably not super important though [22:17:48] i wonder what it is about 1006, it reports terms of 30G, while other hosts are all 9.6-20G [22:19:31] hmm [22:20:01] its all odd, 1006 has the fewest shards (219 vs 230-252) but the most indices (2.4tb vs 1.6-1.7tb) per node [22:20:12] suggests terrible decisions around what "even" distribution looks like by elastic [22:21:39] Yeah both CPU and heap are way higher on 1006 [22:21:51] https://www.irccloud.com/pastebin/AKJzU1th/ [22:21:53] i dunno, if it keeps going we might have to come up with some heuristic to move shards and issue cluster-reroutes to try and balance things better. Maybe with the extra memory overhead being added now we could ban the node and let elastic try and redistribute (per https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell) [22:22:22] with the way memory was looking it didn't seem plausible we could ban a single node and have things still "work", but maybe with the increased heap it will work [22:22:49] yea thats the gc churning and using up all the cpu [22:24:25] The restart itself might shuffle stuff around sufficiently so I'll kick that off now and then we can see if things improve [22:25:10] sure [22:34:45] can see 1006 picked up the new heap values. i suppose the thing to worry about when increasing heap is then we have less file cache, as long as read/write to disk doesn't explode should be ok [22:50:55] ryankemper: shouldn't it be 1 node at a time? [22:51:09] cloudelastic has 1 primary + 1 replica [22:51:41] ebernhardson: hmm, do we have any row-aware allocation for cloudelastic? answer hinges on that [22:51:44] checking [22:52:11] ahh, hmm, randomly from memory is that all of cloudelastic is in the same row [22:52:20] but maybe thats bad memory, worth looking :) [22:54:09] ebernhardson: 2 A, 2 B, and 2 C it looks like [22:54:26] hmm, hopefully will be ok then [22:54:49] Actually 2 A, 2 B, 1 C, 1 D [22:54:56] But yeah so it should be fine [22:55:47] wonder why i remember them all being together, maybe thats something different. There was some thing about how cloud was all in the same row, and to be on their network it had to be in that row? Maybe that changed when we decided to go with cloudelastic on the public internet with local firewalls being the only part that limits to access via cloud [22:56:20] * ebernhardson is always confused by our networking, should stop pretending i have any idea how it works :P [22:57:31] I'm not sure off-hand but your guess sounds likely to me: [22:57:39] > maybe that changed when we decided to go with cloudelastic on the public internet with local firewalls being the only part that limits to access via cloud [23:21:11] looks like everything has restarted? [23:22:08] terms dictionaries are surprisingly evenly distributed now, everything is at 12-15G [23:26:49] started restore [23:33:26] it's restoring with the replicas set to 2 like the source cluster, updating index.auto_expand_replicas to "0-1" [23:37:09] Excellent [23:38:08] it's finished a couple shards in 10m and is doing ~15 at a time. Random guess an hour or two and it should be done [23:39:50] it looks like it's going to restore replicas and primaries from the snapshot repository (as opposed to restoring the primary and creating a new replica from it). I guess it probably doens't matter that much either way [23:44:59] hmm, no it's looking like same failure :S RecoveryFailedException[[commonswiki_file_1647921177][3]: Recovery failed on {cloudelastic1004-cloudelastic-chi-eqiad}{j201lTKrQZiZNanPTTgzIQ}{MSAWXZhGSB2WeG41GMiaXQ}{208.80.155.107}{208.80.155.107:9300}{hostname=cloudelastic1004, rack=D2, fqdn=cloudelastic1004.wikimedia.org, row=D}]; nested: IndexShardRecoveryException[failed recovery]; nested: [23:44:59] D2: Add .arcconfig for differential/arcanist - https://phabricator.wikimedia.org/D2 [23:45:01] IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [snapshot_t309648_attempt_4/UPvfickaRDaX8rPREn-i0Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Premature end of Content-Length delimited message body (expected: 1069931009; received: 734068736]; [23:45:54] so it's not finishing those shards in 10m, they failed in 10m :( [23:50:17] hmm [23:53:22] one suspicious thing, in `curl https://search.svc.codfw.wmnet:9243/_snapshot/elastic_snaps/snapshot_t309648_attempt_4/_status` it reports number_of_files: 3370, and then a little higher up it reports total file count of 7300, and that it did an incremental snapshot [23:53:42] maybe it's trying to reuse attempt_2, and there was something wrong with it? maybe can try turning off incremental [23:53:49] no clue why that would break things though [23:54:27] but if the previous snap was bad, and this is an incremental on top of it, i guess that could cause it [23:57:55] elastic docs say we can cancel the restore by deleting the index. also it says you can't choose incremental or not, it treats the repository as a place that stores files and if it has the file it won't upload it again [23:58:02] (so we'd have to delete the old snapshot first)