[09:31:34] errand+lunch [11:34:30] * cormacparle waves [11:34:37] question about redirects in search results [11:35:07] does a page that's just a redirect get the content of the page it's redirected to in the search index? [11:35:43] hmm doesn't look like it e.g. https://commons.wikimedia.org/wiki/Category:Roses?action=cirrusDump [11:36:11] should it get that content though? [11:57:09] errands, back in 2h [12:14:22] cormacparle: Category:Roses is not a real redirect sadly, if it was yes Category:Roses and Category:Rosa would be the same search document [13:24:34] ah [13:26:59] So what's the difference? One is just a page with a redirect template, where a real redirect has the magic word `#REDIRECT` instead? [13:40:04] cormacparle: yes, a real redirect is rarely seen, MW will silently forward you to the target page [13:50:58] \o [13:51:20] .o/ [13:54:19] o/ [14:22:06] toyed around with spark-nlp yesterday, it can do things like pos-tagging, dependency parsing, etc. for determining quesiton words...but the problem is like all ML it's so fuzy. Theres 99 different models to choose from for each task, proably requires evaluation to decide what to even use. Interesting, but not sure if it will be worthwhile over a very simple heuristic of identifying [14:22:09] natural language queries [14:23:31] yes... [14:45:30] ebernhardson: if we can have a first iteration with the simplest heuristic, that's probably best [14:52:59] gehel: yea, i'll work up something simple too, just wanted to look into if the other options would be reasonably easy to go with [14:53:24] quick review if anyone has a couple minutes: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1189512 [14:53:41] amusingly, i found a suggestion that to do multi-lingual pos-tagging, the trick was "translate everything to english. then identify questions" [14:53:46] assuming we wont do that :P [14:53:53] dcausse e👀 [14:54:48] dcausse: +1, seems reasonable to me [14:54:52] thanks! [14:55:33] +1 from me as well [14:55:43] thanks! [15:03:43] inflatador: retro? [15:04:36] interesting the cirrus-streaming-updater is in ContainerCreating since 2days... [15:04:40] in staging [15:07:41] weird [15:10:46] Warning FailedMount Pod/flink-app-producer-6fc5f987c9-zl9qd MountVolume.SetUp failed for volume "pod-template-volume" : configmap "pod-template-flink-app-producer" not found [15:11:04] hmm, sounds like the operator didn't put something in place? [15:11:18] or does that come from helm release...i dunno :S [15:11:31] no clue.. [15:15:44] seems flink related so most probably pushed by the operator [15:40:11] dumbest natural language detector yet (might still be fine): '\b(who|what|where|when|why|how)\b' [15:42:25] separately, i don't know how we would ever answer some of this. For example google gets the right answer to: ran for Congress in 2010 lost to a woman who had been a public school teacher Michigan [15:42:26] :) [15:42:58] but i have to imagine it would be pure luck if we could answer that query... [15:43:52] well, to be fair i think only the AI-overview got the right answer, the top normal result looks wrong [15:43:56] yes... perhaps you could craft a sparql query for this but unsure if wikidata has that level of details [15:52:15] numbers at least for this are looking quite low, about 1.5% of queries in query_clicks_hourly for enwiki and simplewiki match the regex. Still lots of fuzzy though, like "Girls Who Code women in computer science percentage" matches, but is it natural language? kinda? [15:53:24] i need to spend some time bot filtering first though, lots of that still in here [16:13:18] lunch, back in ~1h [17:14:46] fun with stemmers: What talk pages are not for -> what talk page [17:19:23] stop words are bad sometimes :) [17:22:36] i'm unsure how to consider duplications...on one hand it does make sense to see queries from multiple users. On the other hand, only 100k out of >3M norm_queries had a question word, distinct norm_query issued a query with a question word over a week. If we require at least 2 identities it drops to 8.5k out of 500k [17:22:49] (from 1 week of enwiki query_clicks_hourlY) [17:23:12] * ebernhardson fails at english [17:25:28] and, perhaps unsurprisingly, the query with the most identities over a week normalizes to .... "what sex" [17:26:08] s/query/norm query/ [17:35:01] not sure what dedup is bringing? esp. if it's relying on very aggressive normalization [17:35:59] perhaps just to give more importance to some repeated questions, but for filetering yes seems like you lose quite a lot [17:36:56] i guess the idea is that queries from a single user can be just noise, queries from multiple users are more likely to be actual things [17:44:14] mgerlach: for analysis of natural language queries, are you looking for a full writeup (like notebook w/ prose, rendered to pdf), or just some baseline stats in a table? [17:48:13] I think baseline stats in a table + a notebook with some code and explanation for reproducibility in the future could be sufficient. I will prepare a doc with the literature review where we can add a section to provide some context for the numbers. would that work for you? [17:49:00] mgerlach: yup, that will work [17:58:23] doh....realizing that query_clicks_hourly doesn't include mobile web :S [18:10:55] is this because we disable search satisfaction on mobile? [18:12:54] yea, we don't deliver the code. IIRC was turned off many years ago as "too much code" but in 30s i couldn't find the patch [18:14:28] so the "Search Metrics, Web" dashboard you rely on other datasets to infer the autocomplete pageviews? [18:15:53] dinner [18:16:07] yea that comes from webrequests...i guess i have to do the same here [18:16:18] (its just way more tedious to process, due to volume) [18:17:30] is was disabled here: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/340670 [18:17:54] but i guess we were already skipping minerva at that point? [18:18:46] looks like i disabled it in 2015 :P https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/248087 [19:16:52] * ebernhardson notes http_status is a string in webreq...wonder if anything crazy is hiding in there [20:11:55] * ebernhardson would really like to stop offering `Did you mean: exampled` to the query `example` [20:45:58] inflatador: working on wdqs lvs teardown with brett, so I won't be at pairing. we'll be in https://meet.google.com/csg-dahy-fuz if you want to drop by [20:55:16] ebernhardson: does `exampled` actually get shown to the user? I tried it on enwiki and didn't get a DYM suggestion. I searched for `exampled` and the DYM suggestion was `example`. [20:55:42] Trey314159: i guess i searched for `~example` instead, but the ~ shouldn't change anything [20:55:59] Trey314159: and yes i still get that now (although i guess i probably land requests in codfw, and you land in eqiad) [20:56:02] * Trey314159 must fight the urge to change every instance of `exampled` on-wiki to `exemplified`... [20:56:15] huh [20:56:15] lol [20:57:04] it also has 400k results, which i thought meant it didn't get a dym (anything > 10k) [20:57:08] oh, it's because exampled came from glent [20:57:27] https://en.wikipedia.org/w/index.php?search=~example&title=Special%3ASearch&cirrusDumpResult shows fallback-1 from glent_production returned that [20:58:10] ryankemper will stop by in a bit [21:09:19] taking off a little bit early...need to do some chores before the vaccines I got today totally knock me out ;)