[06:54:51] I wanted to draw your attention to this ticket: https://phabricator.wikimedia.org/T306233. The issue apepars to be: Caught exception MediaWiki\Extension\Translate\TranslatorInterface\TranslationHelperException: Elastica exception: Elastica\Exception\ResponseException: Fielddata access on the _id field is disallowed [reason: all shards failed] in /srv/mediawiki/php-1.39.0-wmf.7/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php:182 [06:55:47] This should be fixed by 783907: ElasticSearchTTMServer: tie break on wiki+localid | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/783907 but this has not yet been deployed. We're backporting it. [06:56:39] But I'm wondering why ElasticSearch was updated without Translate extension changes being deployed too. [06:58:39] I should probably introduce myself. I work in the Language Engineering Team at the Foundation, specifically the localisation infrastructure. [07:18:29] abijeet: thanks for the ping! [07:19:24] dcausse (who should be around shortly) will have more background than I do. I'm surprised that the upgrade to 6.8 is breaking this. I would have expected breakage with the 7.10 upgrade. [07:57:23] abijeet: hi, looking [08:07:48] abijeet: sorry about the mess, I thought I tested Translate both elastic 6.5.4 and 6.8 with all my patches but I failed this particular one... [08:09:11] it is also possible that because we haven't re-indexed the "ttm" index recently it did not have the settings it should have when created with elasticversion 6.5 [08:10:40] filing a task a task to make sure we re-index it before we move forward with elastic 7 upgrade [09:36:35] elastic1097 is down since 11 days, but I couldn't find a dc ops task, is that a known hardware issue? [09:46:06] moritzm: I see it still in staging state in netbox, seems one of the new ones, so maybe dcops or cathal might know more about it [09:52:38] ack [09:53:55] ah yes, the racking task https://phabricator.wikimedia.org/T299609 was only closed last week, maybe some leftover, I'll ping Chris [10:25:12] dcausse, thanks for the info. [10:39:01] lunch [13:12:58] greetings [13:13:46] ^^ subscribed to task above [13:18:37] o/ [13:20:57] Hi folks - please remember to fill in the time tracking spreadsheet - I need the info for the SDAW report due soon. Thanks! [13:43:07] here's the spreadsheet https://docs.google.com/spreadsheets/d/1Gnxo3Mnde_ipdTzfq9NlSgb9KLMnaqs3tf_pe0pFxQk/edit#gid=0 [13:43:26] and please break out time specifically spent on WCQS, image suggestions, SDAW search or other SDAW work [14:04:43] ACK [14:22:38] mpham, gehel could I get an account to asana? we need to answer to T306422 but can't remember how we adjusted the timeline last week [14:22:39] T306422: Upgrade translatewiki elasticsearch version to 6.8.23 - https://phabricator.wikimedia.org/T306422 [14:26:51] mpham: thanks! [14:58:47] \o [15:02:28] o/ [15:05:54] surprisingly, the wbsearchentities optimizer looks to still completely work. The verification process which re-runs the equation gets the same scores as the original explains, and it comes up with weights that score better than prod [15:06:30] i suppose need to change the profile in prod next so it generates the new shape of queries when reuqested [15:06:37] (new profile) [15:09:39] errand [15:13:32] ebernhardson: the rescore query (statement boosts) might change due to negative weights but IIRC it does not try to optimize this part? [15:14:24] dcausse: i suppose i didn't look, sec [15:14:55] dcausse: in terms of weights, we set the a/k/weight on satu's, and the query weight and rescore query weight, but i suppose not those de-boosts [15:15:20] ok [15:15:25] hmm, but then i'm not sure what goes next :P have to think about it [15:16:04] i guess i could try and adjust it to pull in those weights, i don't know why it doesn't parse them as tunable (probably they get recorded as constants) [15:17:24] does it autodetect tunabale params? [15:17:50] sorta, each parser type (bool, match, etc) knows what params are tunable [15:19:13] hm... if it gets to the satu function it must be inspecting the function score array then [15:20:42] yea it should, they are just more queries provided to the same function_score query [15:22:49] hmm, so that part i handle with something called a PassThruExplain, in short that means it writes the explain value into the feature vector and spits it back out at evaluation time [15:22:58] but maybe not too hard to adjust [15:25:05] ok so should not break if we flip negative weights+term query to positive weights+must_not term query [15:26:02] yea it should also work fine if we hardcode those in a new shape [15:29:29] i'm not sure how the final function needs to be composed though, to avoid negative weights the function_score inside rescore needs to change to score_mode: product i would assume, so we can have fractional weights, but then does the top level rescore have to change from sum to product too? [15:30:05] i'm thinking to make it possible to de-boost it has to accept x * value with x between 0 and 1 [15:34:34] making it a product won't work well if you to vary the weights of things like sitelinks [15:34:52] query_score * w1 * sitelinks [15:34:58] w1 is pointless here [15:35:35] hmm, indeed :S [15:36:21] I tested wrapping the term query in a must_not and it works ok [15:36:28] also realizing i never had this over-request result hits, we use the default size:8 when tuning, which means there are probably results just out of range that will be pulled in. Maybe i should adjust this to pull the top 30 or something. Thinking that more drastic changes to the equation will bring in more results that might newly end up in top-n [15:37:07] oh I see [15:37:28] dcausse: hmm, i'm not following how the must_not would work, do we simply exclude the results? [15:37:38] prevent them from having a rescore value basically? [15:38:22] this is to handle negative weights: weitght: -0.4, query: term:foo => weight: 0.4, query: must_not: term:foo [15:38:39] ranking is the same, scores are a bit different [15:38:43] ahh, boost the inverse [15:38:53] ok, yea that makes more sense [15:38:56] perfwize it's very similar [15:39:35] that seems quite simple as well [15:40:13] yes it seems straightforward [15:41:59] ok, i can work out how to make a new profile option to flip those around and then re-run. [15:49:25] hello folks! I have reopened https://phabricator.wikimedia.org/T299609 since elastic1097 showed up down in icinga, there seems to be a boot issue [15:50:53] ebernhardson: I was planning to work with Emmanuel on the cirrus part, basically adjusting some code in cirrus to automatically flip the weight and the term query when it's negative [15:54:03] dcausse: ok, can wait a few days but i'd like to keep this moving [15:55:04] sure, should not take long, I'm a prepping a unit test for Emmanuel to work through, should be mergeable by thursday hopefully [15:58:18] ebernhardson: but the new model you've built can be shipped already [15:58:38] dcausse: indeed, in theory we could ship or test the new weights [15:59:16] the old metrics seem to still be running, this tuning is built from the WikidataCompletionClicks schema and i think the analysis was same, will look it over today and see how plausible [16:01:32] ok [16:06:26] quick workout, back in ~30 [16:31:28] ryankemper, inflatador, ejoseph: kickoff for Search Updater - https://meet.google.com/fna-jbvm-hrf [16:50:12] gehel I have it on my calendar in 10m? [16:50:37] nm, wrong mtg... [16:50:43] inflatador: that's the search / research kickoff [17:36:03] dinner [17:36:15] somehow that was a much more understandable description of how to use intent than many other things i've read [17:42:20] Lunch, back in ~30 [18:12:19] back [18:27:58] gehel: inflatador: will be ~5 mins late to pairing [18:28:11] ACK [18:35:45] the more i look at this prior tuning the more mistakes i notice :P We only consider prefixes of the search term, instead of prefixes of the title clicked, when building the total set of searches to run optimizations over. Probably shouldn't try to fix now though :P [19:27:00] lunch [20:17:50] back [20:18:21] 14/36 eqiad nodes are on ES 6.8...at this rate, probably be another 2 hr or so [20:21:40] nice [20:22:54] I'm noticing cirrus failures jump up to ~300 ops/m in eqiad, is that acceptable or should we look at depooling? [20:23:55] inflatador: hmm, depends what they are, looking [20:24:50] btw ebernhardson , we did increase that recovery timeout as you suggested https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/784310/2/cookbooks/sre/elasticsearch/rolling-operation.py [20:26:24] inflatador: does it seem to have changed anything? [20:26:33] in particular with shards shuffling around [20:27:40] inflatador: errors look to mostly be 'error while communicating with remote cluster [omega]' [20:27:47] inflatador: from `channel:CirrusSearch` in logstash [20:28:29] inflatador: so we wont need to shift anything, this error should be transparent to users (except missing extra results). As for why it's happening though, hmm. We had to look into the cross-cluster settings a few times recently [20:28:34] re: shard shuffling, it's tough to tell from the CLI. Maybe there's something in the APIs we can check after-the-fact? Maybe compare to the DFW upgrade from last wk [20:30:01] inflatador: kind-of get an idea from the /_cat/recovery?active_only output, there is also a graph but it's not the easiest to understand, looking for it in grafana sec [20:30:37] inflatador: i suppose this one might be best for seeing restart shuffling but it's not all that obvious. I suppose we would be looking for how long the unassigned shards state lasts, which prevents the cluster from going green: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=64 [20:31:57] hmm, i suppose we need initializing and unassigned to clear out [20:32:06] subjectively it does look like we have less relocating [20:32:40] and it also seems that less of the large shards are relocating based on the recovery API [20:32:50] but that could be confirmation bias ;( [20:33:06] shrug, if it looks like its not obviously worse we can keep it for awhile i imagine [20:35:44] yea, comparing the last 3 hours against 3 hours of restarts on 4-7 this looks noticably better: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&viewPanel=64&from=1649361600000&to=1649372400000 [20:44:28] I dunno, looks like ~350 unassigned and ~15 initializing on both graphs [20:44:45] but I could be missing something...subjectively it seems like we had a lot more in relocating state last time [20:48:01] I'm also not seeing any in unassigned from the health API, seems to conflict w the ES graphs [20:53:26] inflatador: i'm looking more at the time it takes between restart and the cluster reporting healthy (so the next restart can happen). In the graph from 4/6 the restarts are happening 26-35min apart, in todays graph they are closer [20:53:40] ah nice [20:53:49] Here's the DFW upgrade if you wanna drill down there too https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&viewPanel=64&from=1649790900000&to=1649808059000&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [20:54:54] looks like todays restarts are going in 15-22min, shaving a few minutes on each one [20:55:59] curious, that one is kinda all over th eplace. In the middle from about 20:15 to 21:30 they are restarting quickly, but others report longer [20:57:44] you can get a rough timeline from SAL https://sal.toolforge.org/production?p=0&q=elasticsearch&d=2022-04-13 [20:59:37] Looks like I had to run the cookbook twice, that will skew the numbers [21:00:56] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&viewPanel=64&from=1649790900000&to=1649800799000&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 <-only the first cookbook run [21:01:30] inflatador: perhaps an interesting thing in these graphs, highlight just delayed_unassigned_shards initializing_shards and unassigned shards [21:01:47] inflatador: the delayed ones are where the cluster shuld be waiting for the node to come back instead of assigning the shards to a new instance [21:02:04] but critically, the delayed drops off before it's clearly picking up new shards, because unassigned is still flat [21:03:08] basically, i wonder if we should be increasing the timeout related to delayed_unassigned_shards so it has enough time to assign them back to the local instance. But elasticsearch could also have magic i'm not aware of (maybe it cancels a relocation if the node comes up?) [21:03:43] it's supposed to, based on https://www.elastic.co/guide/en/elasticsearch/reference/6.5/delayed-allocation.html#_cancellation_of_shard_relocation [21:04:44] hmm, i suspect that ends up not quite the same, if it works as described there. In particular it says "and its shards still have the same sync-id as the primary, " but since we know writes are coming through the sync-id won't match [21:04:54] what we are expecting is it to pick up an old sync-id and bring in the transaction log [21:05:23] but maybe it does and the docs just aren't clear [21:05:25] would that be considered a "translog" recovery type? [21:05:29] yea [21:06:51] OK, there are some of those happening for sure [21:08:14] maybe we could scrape that kind of info out of the recovery API? % of translog/local recoveries vs peer recoveries before/after change? [21:08:50] hmm, yea perhaps we can just look at the number of recoveries that go to restarted instances vs the rest of the cluster [21:12:49] not sure if this is particularly illuminating, but here's the same shard being recovered 2 different ways at the same time (different source and target hosts as well) https://phabricator.wikimedia.org/P25461 [21:14:10] inflatador: would need the shard # to see if its the same, i suspect the 17 and 19 in there are the shard numbers since it has 21? [21:14:44] in that case it's moving different data about the same index. The limits are per-index, in the case of enwiki_general we allow up to three different shards to be on the same host [21:15:14] i suppose saying those both could have come from the same node that was restarted [21:16:39] oh, hoops! [21:16:42] or oops [21:17:29] i'd be tempted to just set it to 10m and see how the restart finishes [21:17:55] new batch just started and I saw a bunch of translog recoveries flash in and out of my watch cmd [21:19:41] makes sense, they should go quickly with minimal info to transfer. So it's at least doing some, i suppose i wonder about the recoveries that are starting in this graph prior to the main bulk of ~350 shards being picked back up [21:20:03] if those recoveries had been delayed, would they have been picked up by the node without a full recovery [21:22:20] i don't think it can harm anything, applied to eqiad. will take a minute to process: curl-XPUT localhost:9200/_all/_settings -d '{"settings": {"index.unassigned.node_left.delayed_timeout": "10m"}}' [21:23:05] keeping them long term would mean adjusting cirrus settings to set those on index creation [21:23:35] cluster says it finished applying [21:23:41] interesting [21:24:06] i suppose mostly i'm curious how it will graph for the next round of servers, hoping theres one left :) [21:24:23] yeah, we have 13 to go [21:24:59] ok great [21:38:03] I think I need an explain API call in one of my tmux windows too [21:41:08] interesting, so far looks basically the same [21:45:46] hmm, I guess the 'cat health' API doesn't show delayed unassigned shards [21:46:51] (more script tweaking needed) [21:49:05] unfortunately elastic's logs aren't too telling. Looking at the enwiki shard it decided to transfer it only logs when it became unavailable, nothing about the decision to recover it [21:49:34] that was 15min between reboots which is short-ish but probably not enough to claim that did anything [21:59:37] i've just increased the allowed recovery network traffic to 240mb, can see from the `cluster overview` in grafana that the instances were regularly maxing their 80mb quota. 240mb is a bit arbitrary, but we know they can do 500mb+ from previous instances where we don't have enough memory [22:01:32] err, the 500mb+ is for the disks serving up data. i'm mostly assuming network is fine :) [22:02:39] Cool, that works [22:03:28] If you want to jot down your changes in https://phabricator.wikimedia.org/T305994 (and any observed benefits/drawbacks) we can get back together and figure out whether to set them everywhere [22:05:27] sure [22:06:38] * inflatador carefully avoids saying "circle back" ;) [22:10:36] lol [22:12:22] 32/36 done! [22:13:51] i wish i knew what it was deciding when creating all these re-init's instead of translog recoveries. might have to poke through elastic code to try and remember what they are doing [22:20:32] affinity **shouldn't** have anything to do with it, but it's the only thing I can think of ATM [22:24:53] I'm out for the day, but ryankemper is taking the helm on the ES upgrade [22:26:08] kk, enjoy [22:26:20] ebernhardson: wrt the network quota, that reminds me that eqiad is ready to flip to 10G networking so we shoulda done that before the upgrade anyway [22:26:37] although it is nice to keep variables the same so that we didn't falsely attribute a performance improvement to the sleep change or smth [22:28:15] yea, makes sense [22:48:19] * ebernhardson is failing to find where exactly in elasticsearch they check (and maybe log) the sync-id in a shard recovery...but will hopefully find something :) [22:59:46] hmm, so i guess thats all RecoverySourceHandler:phase1, needs trace logging to say much. Could increase it to debug logging