[08:09:26] No retro for me tomorrow, but will be around immediately after for my next mtg [10:01:08] Lunch + errand [10:11:38] lunch [10:27:31] Hi! Do we have an AVRO schema page-edit-kafka events coming from media wiki? AFAIK s.o. showed me one a couple of days ago. And in case we do have such thing: Do we generate POJOS + (un)marshalers from it? Is there an artifact I can reuse? [11:00:58] pfischer: we have json schemas: https://schema.wikimedia.org/#!/ [13:28:16] greetings [13:31:01] Greetings! [13:31:51] I'll probably miss the retro, too—or be rather late—because I have a doctor's appointment [13:53:05] o/ [14:01:14] pfischer: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas#Motivation_and_Overview [15:00:48] sadly no code generation from json-schema. There might be something out there, but i'm not familiar with us using any of them (but we could if there are good options out there) [15:01:03] Retrospective is starting: https://meet.google.com/eki-rafx-cxi cc pfischer, ejoseph [15:58:58] we might be using json -> Row from eventutilities and then design our internal model [16:01:58] \o [16:07:24] hmm, so wcqs oauth options...we could shrug and put the oauth_token_secret into the cookie as plain text (easiest). I don't think that's too dangerous, the request has to be signed with both the token secret and the consumer secret, the token secret on its own does nothing. We could import something like google tink and do a symetric encryption of the secret before putting it in the [16:07:26] secret (minorly harder, need to create and deploy extra secrets to each server). Or we could find some database thats not kask to store the token->secret mapping in that's readable from all dc's (more painful, needs ttl management, etc.) [16:09:31] or we could use java encryption primatives directly instead of tink, but seems error prone [16:10:11] * ebernhardson isn't entirely clear why the token has a secret of it's own and why the consumer secret isn't enough, but i suppose the oauth1 designers had their reasons [16:10:38] i suppose somewhat mitigated by these tokens not being able to do anything beyond calling an identify api that tells you who the token belongs to [16:12:56] and making the jwt token ttl shorter, e.g. couple hours, would that force renewing the request token in kask? [16:13:44] although I'm surprised because mw sessions are pretty long lived (but that's perhaps another storage?) [16:20:39] dcausse: mw sessions are long lived because they are backed by auto-login through the database. Essentially the session (and contains csrf tokens and such) go away after 24 hours, but the session token will allow auto-login and creating a new session based on a row in the mariadb [16:21:27] we would only be able to auto-renew within kask if the bot makes at least one call per day. If it breaks on friday and they fix the bot to start making requests again on a monday the session wont be valid [16:21:51] * ebernhardson had to read a bunch of things yesterday to figure this all out :P [16:24:27] the initial login cannot be automated? [16:25:19] well, they would have to put the mediawiki session token into the bot, but if that token is leaked it can do literally everything, they are logged in. unlike oauth tokens that are restrited [16:25:54] i dont really want to write docs that say to put the mediawiki session token from the browser into the bot, seems like mistakes would be too painful [16:27:25] maybe i'm overworrying...those sessions are easily invalidated by logging out and back in [16:29:51] the general idea is they would have to be able to do a redirect bounce from wcqs to mediawiki and back, and the request to mediawiki would have to include the session token from their browser so it appears logged in. The oauth1 side has no normal api access [16:34:34] i suppose another option, the oauth_token doesn't change and we use that as the key...We could tell people they have to make requests every ~8 hours or so with their bots, and we would refresh tokens in kask that have > 12h of lifetime left. If their bot starts failing due to session timeout they would have to visit commons-query in their browser to start a fresh session that keeps the [16:34:36] old cookie value. Seems hacky and inconsistent though [16:36:15] i also played a bit with jwe (encrypted jwt's) yesterday, it's a possibility but the apis available are terrible. we use com.auth0 for jwt's right now and their api is nice, it's hard to do the wrong thing. I toyed with nimbus-jose-jwt yesterday and they make it exceptionally hard to do the right thing, and trivial to do it all wrong [16:36:40] com.auth0 doesn't support jwe's, they said basically jwe's are rare and they dont see value in supporting them [16:39:02] so the problem is that there's no stable token that can be set in the api client (wcqsSession is going to last for the jwt ttl and wcqsOauth for 1 day in kask)? [16:40:02] yea, basically. We have no long term storage. The best long term storage we seem to have is actually asking the user to hold it inside their cookies [16:40:29] we don't really want to start managing an sql database on the side, could in theory stick things in an elastic index but seems awkward [16:40:40] and jwt cannot encrypt something? [16:41:00] that's not understandable by the client [16:41:32] not with com.auth0, the one we use. Can with nimbus-jose-jwt but i really disliked their api. Which leaves something like using google tink (or raw java primatives) to encrypt a value and put it in the existing JWT [16:43:44] actually it would be a separate jwt i guess, or maybe it doesn't even have to bee a jwt. a jwt is about signing the value and ensuring that value was signed by us. We would need the existing JWT for the standard request-auth, but for refreshing the JWT we need the second value of oauth_token + oauth_token_secret [16:44:17] that could be as simple as concatenating the two strings (they have an expected length) and encrypting the result [16:44:56] why do we want to encrypt it? if the encrypted version is leaked the damage is the same? [16:45:20] mostly because the oauth1 spec says oauth_token_secret should be treated as a password [16:45:47] although i don't fully understand why. To make a request with an oauth_token you have to put the request data together, then sign with both oauth_token_secret and consumer_secret. Basically two passwords [16:45:52] this oauth_token_secret can only be used to auth against wcqs or something else? [16:45:55] it seems like making one of the two passwords public does nothing [16:46:20] it can only be used to sign requests to Special:OAuth/identify [16:46:31] (those are the restrictions we asked for when signing up as an oauth consumer) [16:47:07] so basically, worst case is almost nothing. An attacker can find out the username attached to the token [16:47:26] but they would also need our consumer_secret, found in puppet secrets and on each wcqs server, and in the jvm heap i suppose [16:47:57] I mean the risk is: a bot owner have to set a kind of token in their client [16:47:59] which is why option 1 was ignore things, simply make the oauth_token_secret part of the cookie [16:48:21] if this token is leaked because pushed to git [16:48:42] what would be more secure if we encrypt it? [16:49:13] I'm not entirely sure :P The only reason i'm thinking about encrypting the secret is because the oauth1 spec says to treat that like a password [16:49:25] but it seems like half a password to me [16:51:07] (I don't fully understand all this) if this allows to login to somewhere else it's bad but if it only allows to use wcqs the risk is not higher because it's plain [16:51:59] yea it's not usable anywhere else, the token and token secret are specific to wcqs as an oauth consumer, [16:53:08] hm... so I'd go for sharing all this with the client so that they set-up their client with it (maybe we could ask Gergo to confirm it's not a big issue?) [16:53:41] yea, it seems like encrypting the token secret doesn't really buy much. The final value that the end user gets still does the same things, allows access to wcqs, and nothing can really be done with the secret on its own [16:54:09] lunch, back in ~1h [16:55:32] i'm probably putting too much worry into https://datatracker.ietf.org/doc/html/rfc5849 section 4.5 : The client shared-secret and token shared-secret function the same way passwords do in traditional authentication systems... Accordingly, it is critical that servers protect these secrets from unauthorized access. [16:55:54] our use case is narrower than theirs, which probably includes tokens that give administrative access and all kinds of possibilities [17:10:45] ryankemper, inflatador : I might be late for our meeting. I'm in a meeting for Oscar's school, but it already seems that it is going to be longer than expected! [17:10:56] Or it might just feel that way [17:11:14] gehel: ack [17:11:29] ebernhardson: my reading of 4.5 is that the pluralization implies that one needs both the secrets to do something nasty anyway [17:11:34] although it's not terribly explicit [17:11:53] regardless I agree that in our specific case the risk is even lower given that these are just tokens for access to query wcqs [17:12:14] ryankemper: thats correct, section 3.4.2 covers how they are used, the key used for HMAC is `&` [17:12:26] each value is half of the password, basically [17:12:47] will go with easy way then :) [17:24:13] randomly interesting, mediawiki doesn't directly store the oauth_token_secret in their database. Rather it seems they store a secret value in the db, but that value has to be hmac'd with $wgOAuthSecretKey to get the resulting oauth_token_secret. I suppose that protects against unauthorized db access that doesn't also have access to mw secrets [18:08:59] sorry, been back [18:24:12] inflatador, ryankemper : unless you need me, I'll skip our session today. I still haven't had time to get food, so I'll be mostly useless until I'm fed! [18:26:21] gehel ACK, mangia [18:40:40] i suppose we should be aware, with queen liz's passing we now have many nodes that poked up into our `nodes with high load` graph. Not seeing thread pool or pool counter rejections though, just increased latency [18:41:20] full text qps was ~875 vs 500 last week at time time, now down to ~750 [18:42:23] i suppose good to know we can soak up a temporary 75% increase in traffic [18:48:07] ebernhardson cool, we're looking at the dashboards too in the pairing session [19:00:07] We're gonna hold off on incr the codfw masters from 3->5 into tomorrow. There shouldn't really be any risk of doing so now but we don't want to add any sources of noise given the traffic patterns [19:00:20] s/into tomorrow/until tomorrow* [19:08:36] train is rolling forward now, more traffic shifting to codfw [19:10:58] with the plan to roll the train to all wikis in ~30 minutes if things look happy [19:11:57] ack [19:14:30] https://phabricator.wikimedia.org/T313999 [19:26:37] there are some suspicious logs from elastic regarding ttmserver, RemoteTransportException[[elastic2083-production-search-codfw][10.192.32.88:9300][indices:data/read/search[phase/query]]]; nested: QueryPhaseExecutionException[Query Failed [Failed to execute main query]]; nested: TaskCancelledException[cancelled] [19:26:53] also some other related logs that include the query source, will try the queries manually and see whats up [19:28:37] query took a long time, 11s, but otherwise completed and gave some results [19:29:00] i wonder what our timeouts are, i thought they were 30s but maybe mediawiki is hanging up which triggers the TaskCancelled [19:29:25] maybe ttmserver doesn't have same timeouts as cirrus as well, checking [19:30:11] ahh, they set a hardcoded 10s timeout, so highly plausible thats what a hangup from the mediawiki side looks like in logs [19:35:34] i'm perhaps a little surprised that doesn't set the timeout query parameter, can pass timeout=1s and it will give a valid json response with `"timed_out": true` in the response instead of logging errors on the elastic side. I suppose i should update the ttmserver and check what cirrus does [19:36:21] but i kind would have expected calling Elastica\Connection::setTimeout would have done that [19:36:53] anyways, all traffic is to codfw now. graphs are ramping up [19:42:22] oop. you're ain't kiddin [19:42:57] (or however people with brains say it) [19:45:06] two nodes are very unhappy, 2052 and 2045. we probably need to re-balance shards away from them [19:45:37] some level of search requests are being rejected, but the rate of "EsRejectedExecutionException" is declining which suggests it should end up ok [19:50:19] ebernhardson LMK if you want to ban 'em, happy to do that whenever [19:51:56] inflatador: yea go ahead, i've also asked train to roll back and asked for ~an hour to move these shards around [19:53:15] ebernhardson ACK, starting now [19:53:30] inflatador: maybe only 2045, it's the only one with a bunch of rejections [19:53:39] ebernhardson ACK [19:54:19] inflatador: 2052 is high up in the nodes graph, but didn't reject things from the threadpool [19:56:57] 2045 is banned from the main cluster, LMK if you want me to kick it out of omega too [19:57:43] inflatador: hmm, i'm mostly guessing based off thread pool rejections and it only rejected things in the main cluster (115k times ;) so probably only those is fine [20:01:44] better? Looks like the pool counter rejections are dropping [20:03:00] But I also didn't see any shards move when I banned it, did you? [20:03:22] inflatador: hmm, i'm also not seeing any shards move [20:04:50] i'm not clear why yet, checking es7 docs [20:05:26] I excluded it by name, which used to work...i guess I can try IP [20:05:34] yea can't hurt [20:07:38] re "better? Looks like the pool counter rejections are dropping" thats because the train rolled back, traffic is in eqiad mostly right now [20:08:21] sadly i'm not sure of any way to test what will happen with load other than move all the traffic and see what happens [20:10:10] i wonder if elasticsearch merges transient and persistent, or if transient overrides persistent [20:10:17] maybe we should null transient.cluster.routing [20:11:00] i suppoes even though we aren't sure it does anything useful, should move the use_adaptive_replica_selection setting into persistent [20:11:00] ebernhardson yeah, I've been using permanent as transient is supposedly deprecated. Just banned via IP and that didn't seem to do anything either. Will trying nulling transient [20:11:23] (if the adaptive replica selection worked, i would have expected it to stop sending queries to 2045 and send them to other shards hosting the same data until it was happy) [20:12:22] inflatador: same re transient, maybe as a cleanup step of this transition we should empty out transient on all clusters [20:12:28] (but not just this moment ;) [20:13:55] ebernhardson transient does appear empty based on the response of my last call...which shouldn't have happened [20:14:00] https://phabricator.wikimedia.org/P34316 calls I've made so far [20:14:40] inflatador: hmm, on codfw:9243 i'm still seeing transient full of settings, including cluster.routing [20:15:10] guess I'm just confused [20:15:17] response from API was "{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"exclude":{"_ip":"10.192.32.128"}}}}},"transient":{}}" [20:16:08] inflatador: ahh, i think the response only includes the bits that changed [20:16:30] i'd try sending {"transient":{"cluster.routing": null}} [20:18:58] hmm, setting not recognized, even though I can see it in a GET [20:19:32] hmm, [20:21:04] inflatador: trying a few requests out on my end, it looks like we have to null using the full path to each setting, can't null a whole group. I was able to null cluster.routing.allocation.exclude._ip, but not any of the higher levels paths [20:21:25] inflatador: right after nulling the _ip setting i now see a bunch of shards moving [20:22:04] also regarding order of overrides, docs agree about precedence. It goes: 1. transient 2. persistent 3. elasticsearch.yml 4. default value [20:23:06] I guess a null value in transient overrides an extant value in persistent then [20:23:23] null should remove the setting, making it not exist anymore [20:23:34] allowing persistent to then become the one that is used [20:24:04] wow, 2045 already down to 13 shards [20:24:09] things moving faster :) [20:27:41] down to 3, just the 50GB partitions [20:27:46] s/partitions/shards/ [20:34:58] at gym rn btw but will be back in ~50 mins to help w the train stuff [20:37:57] inflatador: looks empty, we can unban 2045 now [20:38:20] hope it makes better decisions this time :) [20:39:48] randomly fun info, the query "the be to of and a in that have I" gets 352.5M results across all indices (it's a query we used before to get elasticsearch to pull some data from disk into memory before a switchover, should have remembered to use it earlier) [20:40:32] :) [20:40:37] ebernhardson OK, 2045 back in the cluster [20:41:12] Stopwords: Their Greatest Hits [20:44:48] hmm, curiously not seeing shards move around much yet [20:45:16] I see a couple getting shuffled back to 2045 [20:45:49] yea just the two for commonswiki, i suppose mostly i'm wondering how long to wait before telling the train they can roll forward again [20:48:18] hmm, we have 4 concurrent node recoveries set, i suppose would expect it to be sending 4 shards at a time onto the node. [20:51:06] it should pause itself anyways, but i'm going to stop mjolnir-msearch on search-loader2001, it justsarted up it's weekly run [20:51:08] I guess I didn't realize ES would start sending shards its way immediately [20:53:19] done [20:54:47] If the rejections crop back up again we may want to try manually rerouting rather than the ban/unban [20:55:14] Like find the node doing the least work and send the heaviest shard on the offending node over [20:56:05] yea the cluster is very unbalanced, many nodes were at 50% cpu, many at 35% cpu, and then on at 95% and one at 100% [20:56:11] I wonder if this is an indication we should increase the primary shard count on the heavy indices up a bit too? [20:56:19] (Not today but soon) [20:56:32] hard to say, i dont understand why elastic is so bad at balancing traffic :( [20:58:17] 64 enwiki_content shards, 60 nodes. i guess 4 node do get lucky. we are going to be decomming some thought right? 60 sounds higher than were expecting but i keep forgetting what the expected node count is now [20:59:08] yeah, we're decomming 36 and below [20:59:47] we have a number of shards started on 2045, guessing it would be ok to send traffic now? [21:00:08] 50 is the expected node count [21:00:13] someday we'll bring the deployed date and rack/row info netbox and have all that stuff at our fingertips ;) [21:00:14] After all decoms [21:01:12] inflatador: you mean bring it out of netbox? or wdym [21:02:51] ryankemper errr..."from netbox"...scrape the netbox API, in other words. [21:03:15] gave the go-ahead to retry the train [21:03:50] we could probably add that to our script at some point. Better source of truth for rack/row info, plus easier to check which boxes need to be replaced [21:12:13] separetly wonder why sometimes i can `curl ... | (head -n 1 && sort)` to get the header on top and sort the rest, but sometimes it doesn't work [21:14:35] 2052 is still pretty high up there but not breaking (yet) [21:20:20] everything looks happy, 2052 is still an outlier though. Will perhaps ban an un-ban it later todady. I'll have to go do a school run in about 10 minutes though, this all looks good enough for now [21:21:26] search-loader2001 is still paused, but it would be idle anyways and search-loader1001 will now pick things up now that eqiad is idle. Just have to remember to not leave puppet disabled (note to self :) [21:23:43] ryankemper: you could try manually rerouting something from 2052 if you want, see if moving a single high volume shard is enough to tip the balance. Suspect it might be [21:25:03] separately wonder if the new nodes simply have stronger cpu's (the passage of time and new process nodes and all that), from the cluster overview graph it seems like everything < 2055 is busier than everything > 2055 [21:25:37] I’m down to do that. Need another 25 mins before I’m back home so I’ll try it then if coast still looks clear [21:25:48] sure, no rush the cluster is doing reasonable thing right now :) [21:25:56] i'm going to run now too, back in 30 min or so [22:01:01] back [22:09:22] same [22:09:26] taking a look now [22:09:29] {◕ ◡ ◕} [22:14:27] Taking a look at the shards on `10.192.48.128 53 99 76 37.36 36.31 37.93 di - elastic2052-production-search-codfw` since this is the node with the highest 15m load average [22:16:28] https://www.irccloud.com/pastebin/NyQQenUV/ryankemper%40cumin2002%3A~%24%20curl%20-s%20'https%3A%2F%2Fsearch.svc.codfw.wmnet%3A9243%2F_cat%2Fshards%3Fs%3Dstore'%20%7C%20grep%20elastic2052-production-search-codfw [22:16:57] So presumably we want to choose an `enwiki` shard since that's what I'd expect to be getting slammed extra hard given the excess traffic [22:17:53] Least taxed node is `10.192.48.56 75 100 23 9.92 10.80 10.85 dir - elastic2084-production-search-codfw` [22:18:16] So let's try rerouting `enwiki_general_1658852412` from `elastic2052-production-search-codfw` to `elastic2084-production-search-codfw` [22:19:19] are we seeing any other signs of poor performance beyond the high load avg? [22:19:36] you can look at the per-node percentiles dashboard [22:20:20] inflatador: Not seeing any pool counter rejections which is the main thing I'd look at: `https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&forceLogin&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=65&from=1662642577449&to=1662675598037` haven't looked at logs specifically [22:20:52] but the tl;dr is no, doing this more as an experiment to inform how we handle this in future (I suspect rerouting is going to be a better approach than ban/unban since we can selectively route from the heaviest to the lightest loaded node) [22:21:20] Ah OK, thanks for the explanation, just wondering how worried I should be ;) [22:21:43] on https://grafana.wikimedia.org/d/000000486/elasticsearch-per-node-percentiles?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-bucket=full_text&var-interval=1m&viewPanel=2 you can see 2052 is also the highest for avg 95th percentile latency in the cluster [22:21:48] usually cpu usage is directly correlated to latency [22:23:54] annoyingly those metrics don't seem to record the cluster name, so we only get the instance parameter which has the prometheus collector's port as the only way to distinguish clutsers (other than the obvious bit that 9109 is high, and the others aren't) [22:24:15] weird [22:24:51] well, i wrote the metrics collection both in the elasticsearch plugin and the prometheus collector, so guess who you can blame :P [22:25:12] that also explains at least of of the latency differences [22:26:33] off topic but that typo makes me think of the classic "paris in the the spring" :P [22:27:05] Okay per https://www.elastic.co/guide/en/elasticsearch/reference/7.10/cluster-reroute.html I think the reroute command should be as follows: [22:27:21] https://www.irccloud.com/pastebin/Op5L6pnW/reroute_enwiki_general_1658852412_from_2052_to_2084 [22:27:58] nice, I gotta get going but I'll hit up the scrollback later [22:28:23] Firing the torpedo now [22:29:25] wonder if a cookbook could do that, ought to be able to give it a node to start from and an index name, and it can choose which shard and where to send it [22:29:49] Well, the response blew out my tmux buffer (could have sworn I had my history set to like 50,000 but apparently I can only fit 23841 in the buffer with current settings) [22:30:24] ebernhardson: yup I was thinking the same, seems like a good option for the script inflatador and I have been working on (which itself could be made into a proper cookbook in the future if we wish) [22:31:07] Dammit I think elasticsearch is convinced it's smarter than we are, the shard is moving away but it's sticking another one immediately in its place... [22:31:11] is https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_totals_capped_at_10,000 something new? [22:31:21] legoktm: nope, its years old [22:31:22] https://www.irccloud.com/pastebin/uCxN1dlE/ [22:31:42] ebernhardson: do you know why they're saying it used to report 44k? [22:31:42] legoktm: oh, but thats a different hting [22:32:16] legoktm: so, that is a new change with elastic 7, they now early-stop queries to save on performance [22:32:45] legoktm: but you were never able to see beyond the first 10k hits, asking for 10001 would error out. Now it also stops the count. We can make it always count to the end, just costs more cpu time [22:33:25] Seems like the count itself is useful to their workflow [22:33:59] yea, we can probably turn it back on (it's a boolean in the search query), and evaluate if it makes sense to save perf / latency in specific situations [22:34:11] Whether that use case is worth the extra CPU for us to change the behavior back is up for debate, but might not be a bad idea to flip that back so we're not surprising users with the behavior change [22:34:15] makes sense, thanks for the explanation [22:34:55] should I file a phab task? I have no clue whether it's a good feature to have or not, but a lot of people have different search workflows that I'd be surprised if this doesn't come up again [22:35:41] legoktm: I think that'd be helpful, even if we decide not to flip the count behavior back that seems like a logical place for us to document the decision [22:36:19] (Personally I'm leaning towards keeping the old count behavior but I definitely don't have a perfect grasp of how relevant the perf implications are) [22:36:43] probably not worse than what we had yesterday without the early-stop :) [22:36:44] will do, ty for the very quick responses! [22:43:20] legoktm: thanks for relaying that to us! (and making the ticket) [22:43:31] :)) [22:45:56] Also TIL about that `{{tracked|T317374}}` template, neat! [22:45:56] T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade - https://phabricator.wikimedia.org/T317374 [22:54:27] looks like shard moves are complete, 2052 has now dipped back down into the grouping of all the other nodes for per-node latency, but it has been on its way down there for the last few hours as we approach the less busy time of day [22:55:21] ebernhardson: so I'm fairly convinced that the reroute approach isn't super likely to succeed given elasticsearch will probably often just immediately stick another comparable shard on the node in its place, like we just saw [22:55:38] so yeah I agree that it seems like latency went down just because traffic has been going down [22:55:59] ryankemper: seems likely, elasticsearch's shard rerouting has never been a fun thing :) [22:56:58] Yeah turns out "smart" systems don't like being told what to do :P [22:57:12] This would be a great use case for adaptive replica selection if that feature seemed to actually do what we thought it would, which it seems not [22:57:43] perhaps randomly interesting, there are two very distinct groupings of nodes in the per-node latency graphs one group bunched up from 150-200 and the other from 220-300 [22:58:00] with a small but obvious gap between them [22:58:14] * ryankemper wonders how well those match up to the instance age (wrt what you were saying earlier about faster/better CPUs) [22:58:26] yea i suspect thats what it is, newer more powerful cpu's [22:58:54] for the same reason as the reroute stuff I'm getting less bullish on the thought of increasing the primary shard count [22:59:41] we will want to re-do the numbers to get almost-perfectly-even shard numbers again (since we last did the math for our previous cluster size of ~36) but that's about it [23:00:14] spot checking random instances, 2040 is 40 cores @ 2.2ghz, 2070 is 48 cores @ 2.4ghz, so thats >20% more cpu without even accounting for IPC improvements (which exist but aren't as amazing as they used to be) [23:00:38] makes sense [23:00:41] ebernhardson: just to sanity check, any given search query is going to fan out over all available read replicas, right? so for example there's not really a reason that replica shard #12 is going to get hit harder than shard #13 [23:01:00] ryankemper: right, if there are 16 shards then elastic will do 16 shard queries [23:01:45] in theory less shards and more replicas is higher performance, which is another option [23:04:20] ebernhardson: and when fanning out to the multiple replicas for a given primary shard #, it's going to do the smart thing and say if there's 3 replicas, search the first 1/3 of one, second 1/3 of another and third 1/3 of the last one, right? [23:05:03] ryankemper: in theory, adaptive replica selection will look at historical latency from a given server, along with estimates of how full it's queues are, and choose the "best" server. But if that worked it would have stopped sending queries to 2045 earlier today :P [23:05:36] I wonder what's wrong with the algo? like maybe it's caring too much about what the queue says and not enough about the actual latency numbers or something [23:05:47] or possibly even vice versa but the former seems more likely at first glance [23:06:09] i suspect part of it is they use https://en.wikipedia.org/wiki/Little%27s_law but that assumes all requests are the same [23:06:12] cause naively I don't really care about the queue except insofar as it offers predictive value towards the latency [23:06:22] and we have requests that take 5ms, and requests that take 200ms, and requests that take 5s [23:07:05] ebernhardson: hmm so that makes sense except for one thing: in what sense is it "looking at latency" if it's treating all requests as equivalent? [23:07:30] or is the thinking that it *is* routing requests away but it happens to be routing mostly the <=5ms requests away so that effectively it's not making a difference [23:08:05] * ryankemper wants his money back, he was told that distributed systems were easy to reason about and always do what you expect :P [23:08:55] ryankemper: it's hard to say, i remember reading when they announced this feature and being hopeful, but then never seeing it work in the situations i would estimate it's most necessary [23:09:18] feels like the great OOMkiller algo debate all over again :P [23:09:31] I'm gonna read https://www.elastic.co/blog/improving-response-latency-in-elasticsearch-with-adaptive-replica-selection, definitely need to do some remedial reading [23:09:33] ryankemper: by looking at latency i mean it says "I (elastic20nn) have seen an average of 25ms response time for the last n requests sent to elastic2045" [23:09:45] basically each node keeps track of the average response latency to shard queries it's sent out [23:09:53] but not on a per-index basis, just on a per-node basis [23:10:46] Ah I think I get it now, so the idea is impact of the expensive requests is massively underestimated because they get averaged out by the much more plentiful fast queries [23:10:47] can be seen in curl https://search.svc.codfw.wmnet:9243/_nodes/elastic2052-production-search-codfw/stats/adaptive_selection [23:10:53] yea [23:11:50] i have some suspsion it would work better if we could push completion suggester off into its own cluster, would be a reasonable thing to move into some sort of virtualization/k8s/nomad/something [23:12:06] also because completion suggester doesn't get updates, it gets built once a day [23:14:56] * ebernhardson wishes that adaptive_selection would report with meaningful names instead of the internal node id's that are gibberish [23:35:35] another thing we could ponder, elastic 7 now has the concept of "routing shards". the value is a bit not obvious, but playing with it in desmos i think it's basically 1024 for all sane values of number of shards [23:36:54] https://github.com/elastic/elasticsearch/blob/v7.10.2/server/src/main/java/org/elasticsearch/cluster/metadata/MetadataCreateIndexService.java#L1221-L1225 [23:41:53] my first graph was wrong :P This should be the number_of_routing_shards default values: https://www.desmos.com/calculator/upa7nr20o1 [23:43:23] anyways, the point is in elastic 7 if we have 16 shards it should set number_of_routing_shards to 1024, and then we can split the index into more shards, any number evenly divisible into 1024 iiuc. But i guess that requires making the index read-only, so not that useful [23:50:46] (i suppose read-only will be easier to implement in cirrus-streaming-updater is we wanted to)