[11:46:15] ebernhardson and Trey314159 ... we never really came to a conclusion on "have we ever thought about using stemmed analysis for autocomplete?" did we? Could we set up a profile that does that and limit it only to adding wikilinks in VE and test it? [11:46:29] I get what you're saying about false positives Trey ... do we have a way of measuring that? [11:47:32] (I'm not asking you guys to actually do this btw, just figuring out of there's a way to do it that's not mad complicated) [13:08:20] \o [13:35:34] cormacparle: i did ask trey about this yesterday, he had some thoughts. He can probably describe better than I can [14:58:58] hmm, how terrible would it be to swap the mediawiki metrics that now say "dnsdisc" with the source cluster? Essentially refactor the dashboard to report qps/latency by mediawiki source cluster, instead of by target cluster. [14:59:33] Sounds a little odd, but thats essentially what info we have right now. Otherwise maybe we could stuff a cluster name into a header and use that on the mediawiki side when reporting stats, but that will probably get messy [14:59:36] If we have that new metric, I'm good [15:00:34] i suppose one other oddity there is cloudelastic, not sure how that fits :P [15:00:45] err....if we have an existing metric to use, that is. We'd also talked about using Envoy instead of nginx for TLS termination, do we need to do that first? [15:01:09] All metrics currently report what site they came from, eqiad or codfw. So we have that info in the metrics today [15:01:34] it would be changing the report from "p95 latency of requests to eqiad" to "p95 latency of requests coming from eqiad" [15:01:51] and we would have to remember if traffic has moved around [15:02:33] or we could report `source -> dest`, but that might be too many lines [15:04:03] Interesting. Possibly dumb question, but would that change if we did use Envoy for TLS termination? [15:05:22] what envoy gives us if we switch, afaict, is metrics between envoy and opensearch over localhost. So similar metrics, qps would be same but latency would be missing any network effects. Not sure if that matters [15:05:50] and yea, `source -> dest` is too many lines: https://grafana-rw.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?forceLogin&from=now-7d&orgId=1&timezone=utc&to=now&var-cluster=elasticsearch&var-exported_cluster=production-search&viewPanel=panel-18 [15:10:59] In that case, I think we will need to refactor the dashboard. If you wanna get a ticket started with an example prom query I can take a look [15:12:50] another thing that will work for some would be adjusting the proxy on the opensearch hosts to add the cluster name to a response header [15:13:06] would require some special sauce on the cirrus side to extract that, but should be plausible [15:13:18] i just don't know which parts are important, lots of options :P [15:22:23] Yeah, definitely think it over. I guess I'd prefer to rework the dashboards unless the cirrus changes are useful for a larger audience [16:39:55] and users already found a problem with my regex changes, missed some edge case :S https://phabricator.wikimedia.org/T399162 [16:47:22] i wonder if it's something in the middle...because adding their query to the integration test it works as expected... [17:01:32] hmm, i can reproduce it. Somehow related to quotes [17:02:15] and the trigram accel, sounds fun :P [17:18:43] err, that's interesting...apparently in lucene regex you can wrap something in " and it becomes a literal string [17:22:22] so `insource:/"u.a."/` searches for the literal string `u.a.` [17:23:33] fix isn't too bad, but we will have to roll a new plugin release [18:11:11] back [18:11:33] np, we can work on it at pairing today or tomorrow [18:12:35] or whenever. I polished up the README for that plugin build playbook, so hopefully it's easier this time ;) [18:32:07] headed to the doctor shortly, back in 60-90m [20:31:59] back [20:32:43] hmm, getting some morelike P95 alerts. Checking... [20:39:51] it cleared, but the alert verbiage says '(mw@codfw to dnsdisc) '....which goes back to the whole dashboard refactor thing [20:51:09] latency dropped back down to normal. It's possible T399221 was part of the problem [20:51:10] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [21:04:57] I set up this repo to play around with opensearch CLI. I really like the profiles feature for our environments https://gitlab.wikimedia.org/repos/search-platform/searchme/-/tree/main?ref_type=heads [21:46:50] ebernhardson: I tried to run the image suggestion DAG via airflow-devenv today, but that fails due to a kerberos issue, which occurs while trying to establish a connection through HiveMetaStoreClient. Does that sound familiar to you? I am not sure if this a problem of the dev-env, the spark-submited execution of the script just works. [21:47:23] https://www.irccloud.com/pastebin/KxZgFWyh [22:05:06] I started on https://wikitech.wikimedia.org/wiki/Search/OpenSearch/Administration , which will one day replace https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration . Still a long way to go