[11:04:03] lunch [11:53:26] More questions on prefixsearch/opensearch ... prefixsearch seems to find strings with a single typo (same as opensearch), is that expected? [12:15:51] ... aaaand something else. I'm trying to explain to someone why they can't just use `deepcat` for everything. I wanna say something like this, just wanna make sure that it's technically accurate. Can someone verify for me please? [12:15:51] > `deepcat` doesn't run on the regular search engine (elasticsearch) - basically it runs a sparql query on a graph database (BlazeGraph) that sits on different servers. The search engine is designed to be highly available for zillions of users and serves responses extremely quickly. The graph database is designed for much more flexible querying but that comes at the cost of speed and scalability. `deepcat` as it's [12:15:51] built right at the minute is never going to be able to handle the scale we need - if we wanted similar functionality in regular search we'd need to store the multi-category data in the regular search engine (and propagate changes in the category structure to it whenever they happen). I'm not ruling it out, but it would be a rather substantial project [12:19:28] cormacparle: That’s my understanding too, but we should wait for ebernhardson to have a look too. [13:00:31] cormacparle: same here! Sounds correct, but ebernhardson (and in a month dcausse) will be able to confirm [13:38:10] thanks folks - will wait for ebernhardson jsut in case ... [13:52:35] \o [13:53:52] cormacparle: roughly right, to do deepcat fully and natively inside the search engine we would have to invert the index, for every page we would have to index every category within a distance of 5 (and keep them updated) [13:54:44] cormacparle: the current way runs a SPARQL query and uses the first 1000 categories it finds as a filter [13:55:54] cormacparle: as for prefix, if you are getting fuzziness on prefix you are probably getting the completion suggester, regular prefix has no fuzziness iirc [13:56:15] regular prefix search instead indexes all possible prefixes of a string as pointing at the string [13:57:11] oh interesting, maybe `prefixsearch` in the action api now points at the completion suggester [13:58:04] cormacparle: its a bit fuzzy, there are api options to choose between them iirc. double checking [13:58:41] 👍 [13:59:07] cormacparle: so for example the generator=prefixserach has `gpsprofile=???`, there you can select strict/classic/fuzzy/etc [16:35:54] ebernhardson: so here's my revised text for answering this user [16:35:59] https://www.irccloud.com/pastebin/Glzd6bs2/ [16:36:14] is that accurate now? [16:36:26] looking [16:37:07] cormacparle: yea that seems reasonable [16:37:17] cool thank you! [16:38:05] on the prefixsearch - I'm guessing that 'fuzzy' corresponds to the completion suggester? [16:49:41] cormacparle: yes, fuzziness is a switch from the strict prefix index to the completion suggester [16:51:05] cormacparle: you should be able to attach `&cirrusDumpQuery` to any request and you will see what its doing, the prefix search will search against something like `enwiki_content/_search` while completion suggester will be against `enwiki_titlesuggest/_search` in the `path` field [16:54:06] it looks like i should be able to move traffic by useing confctl from a cirrus host where i have root...going to try shifting traffic in a moment here [16:54:23] will just depool one of the small clusters from eqiad as a first test [16:54:50] inflatador: ryankemper: ^ fyi [16:58:49] * ebernhardson first tries to figure out which graphs would even show the move... [17:00:32] ebernhardson: ack [17:06:45] hmm, nope i can't do that from the search servers :P They have enough credentials to pool/depool themselves but not the whole service [17:07:28] ebernhardson: need me to do the honors? [17:07:33] ryankemper: sure [17:07:48] ryankemper: i was trying to depool psi, i have a watch up to see the traffic move away [17:09:51] i think these are the appropriate commands: https://phabricator.wikimedia.org/T143553#10939312 [17:11:33] ebernhardson: okay, so I gather I should only depool psi and not the others wrt this experiment [17:11:50] ryankemper: right, first i was going to do just psi. It's probably fine, but seemed a better first test :) [17:12:05] agreed [17:12:12] proceeding [17:12:38] ok ran `sudo confctl --object-type discovery select "dnsdisc=search-psi,name=eqiad" set/pooled=false` [17:13:13] hmm, query rate doesn't really seem to have shifted [17:13:57] hmm, curiously `curl https://search-psi.discovery.wmnet:9643/` from deploy1003 is still giving eqiad [17:15:09] and now it's switched, i guess there is a delay? [17:15:31] was about to say might be ttl related [17:15:37] query rate doesn't seem to have fully transitioned, so i guess we should expect it to switch over a few minutes? Yea maybe dns TTL [17:18:19] hmm, still serving ~100req/s in psi-eqiad and ~20/s in psi-codfw. [17:20:17] i think the ttl is 5 minutes, so couple more [17:27:00] hmm, been 6 more minutes, but didn't see traffic move as much as expected :S not sure what that means :P [17:43:41] ebernhardson: sorry been distracted by dog. should i switch traffic back or is it fine in current state [17:44:47] ryankemper: hmm, its probably fine? The query rate on the clusters didn't change as much as i expected, but maybe thats something else. anything i can check directly seems to be querying cross-dc [17:45:24] ryankemper: i mean, we should either do the rest of the test (move all traffic), or move it back [17:45:42] okay let's do the rest of the test [17:45:53] sure [17:46:51] ok, switched [17:46:53] https://www.irccloud.com/pastebin/gwMJJEeZ/ [17:47:45] thanks! watching the main graphs now [17:53:13] hmm, again not seeing the full change in metrics i would expect :S [17:54:08] although maybe still waiting on ttl..dig claims another 81s [17:59:44] yea, something isn't right :S No change to how active the servers are [18:02:27] switching etcd state back for now [18:02:37] sure [18:03:07] we should def dig in further at pairing tomorrow. at the risk of stating the obvious, we must be missing a piece somewhere [18:04:03] indeed :) I'm not sure what, but with no changes to the active search threadpools i'm sure traffic didn't move as expected [18:04:33] i'll poke around a bit in the mediawiki side i guess, maybe i forgot something... [18:04:45] but with all the metrics recording against the dnsdisc endpoint, i'm not sure what... [18:04:55] interestingly this is the current codfw state, wasn't aware we had main cluster depooled [18:04:58] https://www.irccloud.com/pastebin/GjAc3CUy/ [18:06:02] oh, that might cause the problem i suppose [18:06:18] most traffic is on the big cluster, i guess that implies if both are pooled it at least still routes the requests instead of blackholing [18:06:24] both are depooled i mean [18:06:30] but it shouldn't have impacted psi at all so the first test still implies something is off right [18:06:40] ebernhardson: which graph have you been staring at btw? just `overall qps`? [18:07:35] ryankemper: sadly we don't have great ones, i have a `watch curl ...` command running that sums up the number of search requests the cluster has run and prints it every 2s, then can lok at this to see which clusters are busy: [18:07:38] https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=now-30m&to=now&timezone=utc&var-cluster=elasticsearch&var-exported_cluster=production-search&viewPanel=panel-54 [18:08:25] ebernhardson: should i do one final test where i pool the codfw main cluster and then depool eqiad again? [18:08:33] ryankemper: yea lets do it [18:09:03] ok, just pooled codfw main, I will give it 5 mins before depooling eqiad [18:17:00] ebernhardson: okay, depooled eqiad now [18:18:21] curious, dashboard now shows traffic moving away from 18:07-18:10, roughly [18:18:43] around when i repooled codfw-main maybe [18:18:57] yea, but traffic went the opposite way. Traffic to codfw went down for a few minutes [18:19:08] oh [18:19:13] well I am extremely confused [18:19:13] https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=now-30m&to=now&timezone=utc&var-cluster=elasticsearch&var-exported_cluster=production-search&viewPanel=panel-54 [18:19:22] yea, it doesn't make sense :P [18:20:12] oh that drop starts closer to 18:05, lemme check when it logged to operations [18:20:59] that's within 5 minutes after eqiad was re-pooled, so that actually sounds intended [18:21:37] hmm, i guess maybe [18:34:09] ebernhardson: well, this test is a success. 99.99% of traffic shifted over [18:34:45] https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1&from=2025-06-23T18:17:02.415Z&to=2025-06-23T18:33:53.255Z&timezone=utc&var-cluster=elasticsearch&var-exported_cluster=production-search&viewPanel=panel-54 [18:34:51] well this is the threadpool actually but decent proxy [18:37:32] oh nice! indeed it's all moved as expected now [18:37:55] so, will call prior an artifact of both being depooled are repooled? Not 100% sure how that logics out, but seems plausible [18:40:12] yup exactly [18:40:33] seems like the remaining traffic we saw was just the main cluster being split 50/50 btw DCs [18:42:46] yea seems plausible [18:44:22] i suppose we can shift traffic back and call this complete [18:45:04] should i add those commands somewhere in the search operations docs? atlernately, we can refer to the discovery dns docs [18:52:11] yeah in search operation docs [18:52:19] although I thought we already had a blurb, lemme check [18:52:51] only one mention of confctl in Elasticsearch_Administration [18:53:02] under adding new nods [18:53:05] nodes [18:53:12] oh duh ofc we don't since we just added this capability lol [18:53:27] was thinking of wdqs [18:53:42] ebernhardson: looks like https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Read_Operations needs to be updated [18:54:07] (switched traffic back btw) [18:54:17] thanks! I'll try and update this as appropriate [19:21:30] meh, constantly finding new issues :P our `ExpectedIndices` needs to be updated to query only direct clusters and not the dnsdisc endpoints