[07:28:17] I'm looking for someone to present Search at next DPE Staff meeting (Monday August 26). ebernhardson: would you be up for it? [12:13:20] gehel: present what? [12:13:32] ebernhardson: yep [12:13:54] ebernhardson: Present what we've done for the last month: https://docs.google.com/presentation/d/1qujXYbMcNIkqUu62AQIY36QYRU7VkAS16dMc0J0ZAqg/edit#slide=id.g2e747bf5a7e_0_23 [12:14:14] I already have some notes on that slide, but there might be more [12:14:28] The big thing is probably that we've completed SUP, including private wikis. [12:15:53] gehel: i guess i can, doesn't seem like a lot to say but hopefully they only want a minute or 3 [12:16:26] Part of the responsibility of presenting is making sure that list is complete, asking to the rest of the team if we have more significant stuff that's either completed or coming up next. [12:16:44] Yes, it should be a fairly short presentation [13:25:35] is natural_sort_{asc,desc} the right name? I'm now pondering if thats both not specific enough (natural sort of what?) and duplicative (of course a sort order sorts) [13:28:28] \o [14:14:11] * ebernhardson notes he is apparently trying to take sabatical during DPE offsite 2025 :P [15:04:44] Trey314159: I was able to run the unit test (without fixing the method signatures). There must be no vendor/ folder inside any of the installed extensions and all composer commands must be run in project root. [15:05:15] I’ll update the instructions (at least for CirrusSearch). [15:08:29] * ebernhardson sighs at wbsearchentities not autocompleted Special:* (i mean it's obvious why, but as a ui it's annoying :P) [15:14:46] pfischer: thanks for the update.. I appreciate you looking into it—and updating the docs! [15:17:28] Trey314159: Sure, I am aware of two places where CirrusSearch installation instructions are listed: https://www.mediawiki.org/wiki/Extension:CirrusSearch and https://www.mediawiki.org/wiki/MediaWiki-Docker/Extension/CirrusSearch both suggest running composer in the extension directory. [15:19:31] Are there other places? [15:19:41] let me check my list [15:21:04] None of the others I used had those instructions.. so that's all that I'm aware of [16:22:54] lunch, back in ~1h [17:26:06] ebernhardson ryankemper just a heads-up on T373130 , releng might want to look at using Elastic as the search backend. Nothing urgent [17:26:06] T373130: Revisit Elasticsearch in Phorge - https://phabricator.wikimedia.org/T373130 [17:26:24] the search backend for Phabricator, that is [17:28:50] inflatador: hmm, they considered it years ago but upstream pushed back a lot. The problem is out of the box most search engines have poor relevance. Simply swapping in a different engine rarely does great. It needs specific and targetd tuning for use cases [17:32:37] ebernhardson ACK, I appreciate the context. Sounds like the search quality got worse the last time they tried ES ;( . I was thinking more from a performance standpoint but results are def more important [17:40:32] I'm way out on a precipice of theoreticals here, but would/should we be able to help w/better search results assuming they were interested in pursuing? [17:43:35] i dunno, its plausible but my memory of how the elasticsearch plugin worked for phabricator was that it was quite awkward, in part because phabricators internal data model is fairly complicated (maybe i just don't understand their data model :P) [17:43:44] are they having scaling problems with the current sql based search? [17:44:35] Y, see https://phabricator.wikimedia.org/T353050 [17:45:11] not sure if that ticket leads directly to "use Elasticsearch"...again I'm pretty far out in theoretical-land [18:03:46] i dunno, i suppose my first guess would be to use a replica database and point search at that one (no clue if directly supported). Would at least limit the blast radius [18:07:51] in unrelated random curiosities, the wikidata item with the most properties is https://wikidata.org/wiki/Q39790431 (at 8,345) [18:07:57] s/properties/statements/ [18:08:50] (it is, of course, a scholarly paper :P) [18:24:11] ryankemper I got wdqs2024 to reimage, turns out that the console freezes didn't actually stop the installer. It was apparently timing out during the puppet steps, so I logged in with `install-console` and did them manually [19:28:13] hmm, clearly i know nothing about sparql :P Attempted to get an idea of what the cardinality of P18's in commons are so wrote a query and...no results. But i'm not sure if thats because there is nothing or because i wrote the wrong query :P [19:42:14] What do you mean exactly by their cardinality [19:46:24] hare: the number of unique values. The context is someone wants us to index P18=??? into the search index, and we have to avoid indexing millions of unique values [19:48:14] i thought it would be something simple like `select ?x where { ?x wdt:P18 ?z . }`. But maybe not [19:48:24] err, select ?x and ?z [19:51:42] oh i'm a dummy..they ticket is about monet paintings, and i was thinkings commons. But it's actuall wikidata [19:53:37] Something like https://try.orbopengraph.com/#select%20distinct%20%3Fitem%20%3Fvalue%20where%20%7B%0A%20%20%3Fitem%20wdt%3AP18%20%3Fvalue%20.%0A%7D%20limit%20100 ? [19:54:33] yea, but sadly i need the unique count which it looks like is going to be quite large [19:57:41] * ebernhardson might cheat and skip the timeouts on a test server... [20:02:31] ebernhardson: 5258239 [20:02:38] https://try.orbopengraph.com/#SELECT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%20%3Fcount%29%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP18%20%3Fvalue%20.%0A%7D [20:02:42] hare: nice! thanks [20:08:21] ebernhardson: inflatador and I are scanning over some of the hiera config in `hieradata/role/common/wdqs/main.yaml` and `hieradata/role/common/wdqs/scholarly.yaml` to double check that everything makes sense. Most everything looks good, except I noticed this line [20:08:23] `profile::query_service::sparql_query_stream: 'wdqs-external.sparql-query'` [20:09:02] Since you're my surrogate david at the moment, I could use some help trying to figure out if it's fine for that value to stay how it is, or if there's some main / scholarly specific sparql query stream we should be setting it to [20:09:34] ryankemper: hmm, i suspect it's perfectly fine to stay as is. I do wonder if we will have an obvious way in that stream to tell the two services apart though. [20:09:40] * ebernhardson checks schema [20:10:49] Do you know what the sparql query stream actually is? I found `blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/filters/QueryEventSenderFilter.java` in the rdf repo, but it's a bit opaque to me. Is it related to the bucketing logic in `blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/throttling/ThrottlingFilter.java` or is my hunch wrong [20:10:50] it includes backend_host, which is a little indirect but probably sufficient for now [20:11:15] ryankemper: it's a log of all queries run. It goes to eventgate and lands in hadoop. The schema is https://github.com/wikimedia/schemas-event-secondary/blob/master/jsonschema/sparql/query/current.yaml [20:11:48] * ryankemper is convinced that excessively long java directory structure might be singlehandedly responsible for climate change [20:12:09] at least, it should be. I added a `performer` field to it for wcqs and the intent was that we could figure out who was breaking things from that and be able to contact them [20:12:38] java directory structure is why i have a bash alias called `up`. `up 7` does `cd ../../../../../../..` [21:00:00] pfischer: for tomorrow (or never.. feel free to drop it—that's my plan for now).. when I tried your composer fix, the unit tests would run, but I couldn't reindex! I had to composer install elastica, and then the unit tests were broken again. I'm going to stick with my hack for now, alas. [21:36:16] ebernhardson ROFL