[02:22:47] <dr0ptp4kt>	 hmm
[09:39:43] <gehel>	 that sounds like a question for our legal team. My understanding is that we can host under AGPL, which would require us to publish and changes we make (which is already the case)
[09:40:03] <gehel>	 So, should we stick with Elasticsearch? Or still migration to OpenSearch?
[09:41:32] <gehel>	 Weekly status published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-08-30
[09:44:07] <gehel>	 For a change, I summarized Trey's notes a lot less than usual. I've removed the parts about next steps (this update is about what we've done) and left everythign else as-is.
[09:44:31] <gehel>	 This saves me a bit of time, and I think it still makes sense for people who don't have the full context.
[09:44:37] <gehel>	 Let me know if you think otherwise.
[11:22:03] <dr0ptp4kt>	 it seems like it will be useful to understand how teams generally are thinking about this - for observability needs maybe we're happy with the fork for unencumbered features, but dunno...if we're happy there then the question of the fork's sponsors attention to the project is paramount. just thinking out loud
[11:26:27] <dr0ptp4kt>	 i don't mind the non-summarized notes. those do make their way into a cross-unit meeting. the nature of the work is extremely specialized, so a sentence summarizing each difficult paragraph for layperson could help for copy-pasta to the management meeting, maybe? we should probably all be writing in that fashion to ease the burden on you gehel and any copy-pasters for other status meetings
[13:10:13] <inflatador>	 <o/
[13:12:01] <inflatador>	 Re: Elastic license change, there are a number of paid features in ES (security X-pack, for example) that are free in Opensearch
[13:12:31] <inflatador>	 I think it would be nice to have 'em, but I'll leave that up to everyone else
[13:44:13] <ebernhardson>	 \o
[13:45:08] <ebernhardson>	 inflatador: x-pack doesn't exist in 8.x anymore, i haven't looked closely but my understanding was that was all rolled into core. I also understand there are still license-gated things though
[13:51:52] <inflatador>	 ebernhardson the docs seem to suggest as much https://www.elastic.co/guide/en/elasticsearch/reference/current/security-basic-setup.html
[14:16:44] <inflatador>	 reopened T370661 so we can have the Opensearch/Elasticsearch discussion again...apologies if that is not the best ticket for the discussion
[14:16:45] <stashbot>	 T370661: Decision brief about OpenSearch vs Elasticsearch - https://phabricator.wikimedia.org/T370661
[14:17:10] <ebernhardson>	 yea it does seem like we will at leat need to make some evaluations...i have no clue how exactly to proceed :P
[14:18:40] <inflatador>	 I'm excited to at least install ES8 and play around with it...from a political standpoint I'd rather not have a dependency on AMZ
[14:23:36] <ebernhardson>	 yea, i do find it amusing how online forums decry how terrible elastic was, but are praising amazon. Amazon's track record of open-source is not clearly any better
[14:26:34] <inflatador>	 No, it's far worse. I know we don't love open-core, but that's the business model for smaller players. Is it worse to rely on open-core or completely open-source software that is subsidized/would not exist w/out Google/Facebook/AMZ?
[14:36:02] <ebernhardson>	 school run, back later. might stop at a coffee shop and try reading what exactly is in elastic 8 and whats license gated
[14:36:42] <inflatador>	 cool, keep us posted. I'm currently trying (and failing) to get cgroups to behave as I expect
[15:20:46] <moonmoon>	 iiuc xpack is no longer a separate install, but the configuration all still says xpack.* for backwards compat reasons. A lot of the security features are still behind license gates, but it looks like TLS and auth/RBAC are going to finally be FOSS
[15:42:43] <ebernhardson>	 back
[15:46:56] <inflatador>	 moonmoon ACK, that is all good news
[15:47:33] <moonmoon>	 yeah, although in terms of security features opensearch I think wins in an apples-to-apples comparison
[15:48:17] <ebernhardson>	 it does, i'm just not sure if we need any of those features. I think the main asks we had were about inter-node TLS, although i suppose splitting read and write access might be nice to have
[15:49:23] <moonmoon>	 yeah for WMF not strictly necessary. I host a number of independent wikis and there the document and field level security features are useful to be able to have them share a multitenant ES cluster without being able to see/search data on the other wikis in there
[15:49:46] <ebernhardson>	 i have to read closer, it's not clear if index-based access control is in the free part. Certainly their row-based access control is license limited but we have no use for that
[15:51:00] <ebernhardson>	 moonmoon: we have several features that might make that a bit awkward though, for example `action=cirrusdump` tells anyone with read access to the wiki what the search index looks like for a given page
[15:51:12] <ebernhardson>	 similarly with the prop=cirrusdoc part of the query api
[15:51:24] <moonmoon>	 hmm, I'll have to check those out to see if they break or not :P
[15:51:56] <ebernhardson>	 prop=cirrusbuilddoc does the same kind of thing, but instead of reading it from elasticsearch it generates the same data from the sql databases
[15:53:09] <moonmoon>	 to be clear, I'm only using the document/field level security features in mw_cirrus_metastore since that index is designed to be shared between multiple wikis in the same farm. Otherwise each wiki gets exclusive access to its own index for the actual search bits
[15:53:34] <ebernhardson>	 ahh, ok. that makes more sense and fits better with how cirrus exposes things
[16:28:03] <inflatador>	 workout, back in ~40
[18:05:28] <inflatador>	 oops..been back, but now lunch! Back in 40
[18:07:35] <ebernhardson>	 hmm, apparently wrt making many categories queries, elastic lets us be as dumb as we want by nesting bool queries within each other. The limit is 1024 per bool query, but we can have 1024 bool queries with 1024 categories each nested in another bool query
[18:08:52] <ebernhardson>	 gets expensive though, using 1k categories is ~400ms, 4k is 1s, 9k is >2s, 16k is >3s, by 24k its almost 6s
[18:11:53] <ebernhardson>	 it's never quite clear how much we should allow searches to just be expensive if they do what editors need, but clearly we could allow deepcat to expand into thousands of queries if blazegraph can return them in a reasonable time period
[18:15:03] <ebernhardson>	 q
[18:18:26] <ebernhardson>	 (and that is likely with elastic caching some of the earlier term filters for use in later queries, since i run a query every 1k categories)
[18:19:44] <ebernhardson>	 maybe we let deepcat be expensive, but stuff it into the regex pool counter (and maybe rename that the expensive-editor-query (tbd :P) pool counter)
[18:26:22] <Trey314159>	 do we have a record of how many expensive queries we actually run at once. Do we ever get to 10 in parallel? I'd be up for counting deepcat as a "expensive" along with regex, as a test or just going for it, if it isn't way too much backend work.
[18:27:41] <ebernhardson>	 hmm, i'm not sure how many metrics we have for regex. I know it's pretty rare to see the regex poolcounter in the poolcounter rejections graph. Checking
[18:33:01] <inflatador>	 back
[18:33:25] <ebernhardson>	 hard to tell from grafana :P  It's decent at rates, but knowing "x event happened y times" it's pretty bad at...maybe from logstash
[18:36:50] <ebernhardson>	 ahh much better, looks like regex pool counter rejections come in bursts.  On Aug 20 there are 5 buckets of 30s that have rejections, 500+ in each 30s bucket. 
[18:37:37] <ebernhardson>	 on aug 22 a single 30s bucket with ~800 rejections. same on the 24th
[18:39:46] <ebernhardson>	 heh, only glanced at a couple of the failures (wish logstash.wikimedia.org could easily do ad-hoc aggregations over arbitrary fields), but the query in the three examples i glanced at was `insource:/\#\*:/`
[18:41:35] <ebernhardson>	 i suppose that kinda exposes an optimization that could be applied to the regex query, that seems to be an explicit trigram lookup. It could skip the regex and only do the trigram
[18:51:34] <inflatador>	 we got an alert for morelike in eqiad, but it cleared immediately
[18:52:54] <inflatador>	 trying to troubleshoot w/the shiny new dashboard https://grafana.wikimedia.org/goto/-Tg5mY3Sg?orgId=1
[18:56:09] <ebernhardson>	 hmm, so indeed we are showing increased latency over the last half hour. I wonder if that has anything to do with my simulated deepcat queries running against eqiad...checking
[18:58:13] <ebernhardson>	 (they've been running against commonswiki_file, which is everywhere. But it should still only take 1 thread per shard since i'm not doing anything parallel)
[18:58:37] <inflatador>	 I see some gaps in the job time metrics too...not sure what to make of that
[19:00:53] <ebernhardson>	 well i stopped my script (i was letting it run up to 100k categories in a single search, stopped early at 93k) and the cluster p95 and failures seem to be declining
[19:01:19] <ebernhardson>	 suggests the deepcat limit shouldn't be 100k :P  But hard to say it's what caused it
[19:01:37] <ebernhardson>	 curiously the queries themselves were still completing in a plausible timeframe of ~20s per query
[19:01:53] <inflatador>	 no worries, it never hurts to check
[19:02:41] <inflatador>	 any ideas why we might have gaps in these job time metrics? https://grafana.wikimedia.org/goto/AP_CiY3Sg?orgId=1
[19:03:52] <inflatador>	 looks like this is still a graphite metric. Can't remember, is this one that we can't replace yet?
[19:04:41] <ebernhardson>	 inflatador: yes that query is against the `MediaWiki.jobqueue.run.cirrus*.mean` metric, which is basically not-us
[19:04:42] <inflatador>	 I don't see it listed in T359033
[19:04:43] <stashbot>	 T359033: EPIC: Convert CirrusSearch metrics to statslib - https://phabricator.wikimedia.org/T359033
[19:05:07] <ebernhardson>	 we can only convert metrics we collect ourselves
[19:05:07] <inflatador>	 oh well, not a huge deal then
[19:06:18] <ebernhardson>	 poking over the graphs, it's not really clear why latency would be up, or why we were getting low levels of search threadpool rejection
[19:08:37] <ebernhardson>	 i guess to be sure i could spin my script back up and see if the problem returns, would at least know where it came from
[19:09:45] <ebernhardson>	 letting it start again, but this time starting queries at 75k categories per query
[19:14:40] <ebernhardson>	 curiously, the per-node percentiles graph doesn't show any particular increase in latency over the time period. max morelike p95 across all instances stays ~300ms through the time that cirrus-observed morelike p95 increased to 2s
[19:15:25] <ebernhardson>	 which i suppose suggests the queries themselves didn't take longer to execute, but they were queued somewhere instead of running, or aggregating the results was taking longer
[19:16:11] <ebernhardson>	 i am seeing the p95's going up again since i restarted my script though. Does seem to imply that the cause is from running those large categories searches
[19:19:27] * ebernhardson should probably just post the results and wait until david is back next week, he probably has better ideas what happens when we give lucene 90k terms queries
[20:09:19] * ebernhardson is for some reason surprised there are 15M categories in commonswiki
[20:40:11] * dr0ptp4kt 😅
[21:21:50] <ebernhardson>	 kicking off again, but this time with smaller batches (1k categories, initially) and with parallel queries (10 at a time) against random samples of the full category set, to get an idea of what might be reasonable to allow cirrus to do
[21:34:11] * ebernhardson notes that the script appears to be limited not by elastic, but by python hitting 100% cpu usage :P Was hoping to keep things simple
[22:18:03] <ebernhardson>	 hmm, even 10 parallel requests with 10k categories each is enough to push the latency graphs around :(  Might be acceptable though, letting it run...
[22:19:55] <dr0ptp4kt>	 have a good (possibly long) weekend y'all
[22:20:40] <Trey314159>	 "suggests the deepcat limit shouldn't be 100k" ... that seems quite reasonable as an absolute upper bound. If you need to process 100k categories, maybe you should use a different tool.
[22:20:42] <Trey314159>	 10k is still kind of a lot, though. Of course, we could ratchet it up in steps and start with 2k, then 5k, then 10k if everything seems stable in real life.
[22:21:22] <ebernhardson>	 it will certainly be less in real life, but i suppose i'm trying to see what it will do if someone codes a silly bot that pushes the limits we decide to enforce