[10:02:19] lunch + errand, back soon-ish [10:14:52] lunch 2 [10:52:48] lunch [13:30:36] inflatador: I'll be 2' late for our 1:1 [13:30:42] gehel ACK [14:48:44] trying to catch the last SRE offsite session, will not make weds mtg [15:58:46] workout, back in ~30-45 [16:35:24] aaand back [17:20:46] curiosity, someone is hammering the codfw cluster through mediawiki. only linking mjolnir-msearch because it happens to show both clusters on one dashboard, mjolnir is unrelated: https://grafana.wikimedia.org/d/000000616/elasticsearch-mjolnir-msearch?orgId=1&refresh=5m&from=now-2d&to=now [17:53:21] dinner [18:01:58] lunch, back in ~1 h [18:08:33] weird. more_like_this doesn't seem to vary at all with different track_total_hits values. Variations of true, false, and 5 all return somewhere in the 350-500ms range for the same page repeatedly queried [18:11:42] taking a brief look at the profile, next_doc is invoked ~100M times with track_total_hits: false, and 170M times for true on the same shard (probably not same replica of same shard though) [18:11:51] but the resulting timing is similar [18:12:06] oh, i'm totally missreading. thats 100M ns [18:12:45] next_doc_count is ~60k for false, 65k for true on same shard [18:12:51] anyways, doesn't look worth investigating much more :P [18:53:49] ryankemper: I might be 5' late for our 1:1, sorry [18:58:52] gehel: ack [18:59:10] actually, I'm there! [19:06:12] back [19:35:07] meh, figures. I tried to reindex commonswiki again, one cluster its working away the other it immediately failed with the previous error (succes worked, but subsequent fetching of settings failed). So i wrote a little python script that creates ebernhardson_test with the expected settings and immediately fetches, trying to reproduce a race of some sort. Of course it's created and deleted [19:35:09] the index 20 times now without issue :P [19:39:22] booo [19:49:03] ebernhardson here's the phab task we talked about at unmtg: https://phabricator.wikimedia.org/T318270 . Feel free to add/edit/change. [20:02:58] lgtm :) [20:14:36] ebernhardson: inflatador: hmm so do we really see a bunch of shards for a big index packed onto a single host? that seems at odds with the settings we have (see https://phabricator.wikimedia.org/T318270#8251669) [20:19:28] ryankemper: what we see is a couple instances overloading, and generally a huge imbalance. Check for example the per-host cpu utilization heatmap here: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=codfw&var-cluster=elasticsearch&var-instance=All&var-datasource=thanos [20:21:20] ryankemper: most of the cluster is about the same, then a few hosts are struggling. Can correlate those with https://grafana.wikimedia.org/d/000000486/elasticsearch-per-node-percentiles?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-bucket=full_text&var-interval=1m to see that 2047 and 2052, both outliers on the cluster overview, are also outliers on max latency [20:21:50] (unrelated question of why codfw is receiving more search requests, thorough mediawiki, than eqiad right now) [20:22:02] ebernhardson: right but to clarify, when we see the instances overloading do they actually have *more* total commonswiki_file, enwiki, etc shards? (since that's the problem that the ticket is describing) [20:22:05] seems like a bot pushing hard that routes to codfw [20:22:42] I'll try comparing 2048 to 2047 and see if the # of big shards is similar or not [20:22:45] ryankemper: thats part of what (i assume) the ticket is to figure out. My assumption has been that the reason banning the node and letting the node back into the cluster to get a new set of shards fixes the issue is because it has more of the "busy" hsards [20:23:15] busy is probably more important than big, but perhaps harder to eek out of the data [20:23:27] yeah, that makes sense [20:24:07] busy would have to be estimated by calling the stats endpoint multiple times and comparing how the total counts increase. I think we have that data in prometheus, but only for the top-10 indices by size or something like that [20:25:44] we could probably expand to top 20 or 30 if it seems like the data would generally help [20:30:15] Would it help to look at the shards present on the lowest performers vs highest performers, maybe? [20:30:37] I'm looking at lowest vs average performers [20:30:56] Not sure why but my intuition tells me comparing lowest to highest directly might risk us drawing bad conclusions [20:32:41] (my intuition could be totally wrong) :P [20:32:53] here's the top 10 biggest shards for 2047, 2048, 2052, 2054 [20:33:28] Lowest vs avg does sound more useful [20:33:29] I consider 2047 and 2054 to be average performers (that was just glancing at the cpu graph and not latency, so I didn't check it super rigorously) [20:33:47] https://www.irccloud.com/pastebin/vP1Jvb2n/top_10_shards_for_4_nodes.log [20:34:01] I arbitrarily selected the top 10 biggest shards for each [20:34:55] ryankemper: hmm, maybe only select _content and _file indices, the _general indices are large but rarely busy [20:35:41] Before I analyzed 2054 it was looking like there might be a correlation with the two commonswiki_file shards on 2047/2052 vs one on 2048, but then 2054 ended up having two as well [20:38:48] ebernhardson: ack [20:42:02] https://www.irccloud.com/pastebin/jQk30bFF/top_10_file_or_content_shards_per_node.log [20:46:58] In any case we never redid our primary shard count numbers based off the new 50-node cluster size as opposed to the previous 36 [20:50:23] maybe something like this in promql: sum(label_replace(rate(elasticsearch_indices_search_query_total{instance!~"elastic203[17]:[0-9]+"}[1h]), "hostname", "$1", "instance", "(.+):[0-9]+")) by (hostname) [20:51:09] essentially, that shows that for the indices we monitor, a couple of the nodes are getting 450-500 queries when others are seeing 300 over even 200 [20:51:14] lunch [20:52:46] I do think we should look into a promQL query to approximate "busy", do you use https://alerts.wikimedia.org to test or what's teh best way? [20:57:49] I guess you can just do it in grafana? [21:13:05] inflatador: yea, from https://grafana-rw.wikimedia.org/explore [21:14:17] but there are limitations, elasticsearch_indices_* is only recorded for ~10 indices, that come from the production-search-eqiad.indices_to_monitor in profile::elasticsearch::instances hiera [21:15:00] also, back :) [21:15:48] i'm not sure why my query had to be awkard though, 2031 and 2037 were massive outliers and i wasn't sure what to do other than exclude them [21:16:00] ah very helpful [21:16:58] I need to get better at the ol' promQL, haven't touched it as much as I should since I got here [21:17:11] mostly i google for things and hope to find a reasonable answer :P [21:18:15] hey, that's MY job ;P [21:23:26] bah, I need to learn more calculus/statistics [21:24:16] i mean, we only need the superficial part of calculus. statistics can't get away from, but at least theres no bayes rule here :P [21:26:07] i'm sure i make a bunch of wild assumptions that would make a true statistician a bit upset, but thats ok :P [21:27:35] ebernhardson we're in meet.google.com/bxh-fdwz-zbg if you wanna kick around the query a bit [21:27:44] sure, sec [21:27:54] if a study has p=.05, that means there's a 95% chance that the study is true [21:28:03] okay, I've distracted the statistcians erik, you can make a run for it now