[10:18:07] lunch [11:06:09] * cormacparle waves [11:07:01] Hey folks ... working with CommTech on figuring out how to make it easier for editors to find the right templates, and one of the things the community is asking for is category-based template searching [11:08:41] AFAIK there's no way to search (by category name) the set of all categories that either contain templates, or that recursively contain categories that contain templates [11:08:55] Am I right in thinking that? [12:13:45] cormacparle: we have deepcat for this, if the category tree is not too big it'll find pages in a particular category (including pages in the subcategories), e.g. https://en.wikipedia.org/w/index.php?search=template%3Adeepcat%3A%22Country+data+templates+by+type+of+entity%22&title=Special%3ASearch [12:21:00] but perhaps I misunderstood your question, if it's about understanding if a category has at least one template as part of its "Pages in category" list I'm not sure we have that [13:00:16] aiui deepcat isn't really designed to be heavily used, is that right? [13:01:29] and yeah, what I need is to be able to (for example) type "info" and get all categories with "info" in the title that contain templates (or with descendent categories that contain templates) [13:01:39] so I don't think this is possible [13:01:48] At least right now [13:08:33] cormacparle: deepcat will trigger a sparql query to the categories graph so it is quite costly, so highly depends on what you mean by "heavily used". But, if you have to traverse the category tree it'll be somewhat costly anyways [13:48:29] yeah ... what I was thinking we could do was maybe a data pipeline where we do all the tree traversal in spark, and then write a weighted tag `is_template_category` (or similar) into the search index [13:48:45] that'd allow us to do what we need I think? would you guys be ok with that? [13:49:57] we'd also need to write a CirrusSearch feature to use that weighted tag I guess [13:55:19] cormacparle: sure, please file a task with the high-level details, regarding the "how" not really sure... spark is possible but we would like to avoid big pushes so a streaming strategy would be my preference [13:56:42] weigthed_tags can be populated from mw as well, for instance we have the pageassessment extension that is pushing tags to the article page when its discussion is tagged with some specific template, could be something similar I suppose [14:00:26] if you have a markup on a category that it contains a template you can then mix that with deepcat, searching on the category namespace categoryhastemplate:yes deepcat:MyCategory should list all the categories under MyCategory that are referenced by a template [14:05:22] https://phabricator.wikimedia.org/T392245 [14:08:37] I hear you on "avoiding big pushes" ... let me think about how this might get done via MW [14:08:53] we'd probably need a big push for the initial load though if we were going to do it that way [14:09:11] > if you have a markup on a category that it contains a template you can then mix that with deepcat, searching on the category namespace categoryhastemplate:yes deepcat:MyCategory should list all the categories under MyCategory that are referenced by a template [14:09:38] that's not really what we're looking for - I want to be able search for "template categories" by title [14:10:15] maybe the ticket above explains it better ... [14:14:04] \o [14:15:21] o/ [14:17:22] cormacparle: beware that the category is not really a tree but a graph and I would not be surprised that denormalizing the graph from a spark job and adding a tag to all categories that have a relationship with a category that's used to categorize templates might end up tagging a lot of categories [14:17:55] ye, category graph is more accurate, and it's a graph with loops [14:18:42] so anything that involves flattening this graph at index time might yield non-obvious results imo [14:21:38] i see the very bottom of th eticket recognizes that fact though :) [14:21:44] (cateogry graph with loops) [14:22:50] i wonder how expensive it would be to extend deepcat, we could attach the bit about containing templates to the graph, the depth of 5 limit may be changable based on how much the template filter changes things [14:23:27] 5 is somewhat arbitrary, i think we ran a sampling of queries through it and chose a value with decent response times? [14:23:39] I'm not even sure that flattening and removing the cycles from the graph is possible tbh, and the signal/noise ratio might be poor ... but just wanted to check it's not a shit idea before starting to play around with it :) [14:24:02] getting the info into rdf might at least make it easier to explore possibilities [14:24:13] (but then someone has to write SPARQL :P) [14:27:48] indeed if the RDF category had that bits it could easily filter its I'm sure [14:28:13] i suspect it wouldn't be that hard to update the rdf export? [14:28:53] i suppose it depends how complex it is, i guess thats an extra db query per-category, maybe some batching [14:30:11] there are two dumps for that, the full dump and the incremental one, it's been a while since I looked at them, from memory it's not the most obvious thing but could be doable I suppose? [14:30:26] the incremental one might be trickier [14:30:38] i guess i'd have to poke at what the code there is, i'm not sure what it looks like at all, and i forgot there was the incremental one [14:31:00] would have to detect when the last template leaves the category to remove it [14:31:19] how often do we load the full dump? Or is it typically incrementals? [14:31:40] I think sadly never except for the first run [14:31:50] ahh, then yea that does add some complexity [14:32:23] realizing, i don't even know where that code lives :P [14:32:38] but we might probably want to change that, very likely that the graph has drifted over time quite a bit [14:33:24] i guess it's in core, in maintenance/categoryChangesAsRdf.php [14:33:30] yes [14:33:44] and DumpCategoriesAsRdf for the big one [14:34:11] iirc it has an attribute with the number of articles, could perhaps add another attribute with the number of articles in NS_TEMPLATE [14:35:22] this sounds promising! it'd definitely need the limits on deepcat to be expanded though I think [14:35:30] so say for example this works https://en.wikipedia.org/w/index.php?search=deepcat%3AInfobox_templates+weather&title=Special%3ASearch&profile=advanced&fulltext=1&ns14=1 [14:35:55] the limits are changable, we've pondered added an option to the keyword for users to specify the depth, but didn't have much of a use case [14:36:52] even the secondary limit of how many cateogories to apply is changable, i've done tests with 100k categories instead of the 500 limit (we can't actually support 100k though :P) [14:37:22] what worries me is that this a completion search box no? [14:37:36] oh, indeed sparql is not going to be that fast [14:38:12] yeah the 500 limit means this doesn't work https://en.wikipedia.org/w/index.php?search=deepcat%3AInfobox_templates+buddha&title=Special%3ASearch&profile=advanced&fulltext=1&ns14=1 work even though the buddha infobox category is only 2 levels down from the root infobox category [14:38:22] > what worries me is that this a completion search box no? [14:38:24] yes [14:38:44] but what is completed is the template name not the category name? [14:39:34] no - I want to complete the category name, and then also run a query for templates in the category the user is looking at right now [14:39:46] (which is just a regular `incategory:` query) [14:40:17] so indeed we currently include nodes for the # of articles, and the # of subcategories of a category. But those are currently cheap because they come from the category table directly [14:40:30] :/ [14:40:55] back in 20 [14:41:21] cormacparle: if you use then use incategory: then you must select the category that has template in them not parent categories [14:41:40] s/use then// [14:42:54] I'm not clear where the category graph is interesting in this [14:44:04] ok what I'm imagining is kinda hard to explain without a whiteboard :/ [14:44:23] let me draw some pictures on a piece of paper ... [14:52:25] holy shit ... `Category:Infobox_templates` contains ~2k templates and 18 subcategories [14:53:10] :) [14:53:24] I don't think the way I was imagining this is gonna work with that many templates in a single category [14:54:08] let me mull it over a bit more, will be back to you next week if I come up with something [14:54:29] cormacparle: sure! [15:00:01] back [15:03:41] * ebernhardson idly wonders if there are any ML-style features developed by reasearch that we might be able to pull into mjolnir [15:03:59] * ebernhardson is never going to learn how to spell... [15:28:38] no clue... and not even sure where to look at other than https://gitlab.wikimedia.org/repos/research/ [15:29:19] seeing simple-summaries, wondering if it might be interesting as a replacement for opening_text [15:29:37] yea, i partly wondered after seeing the bit about the new dumps shared with kiwix [15:29:48] maybe not kiwix...lemme find that [15:31:00] opening_text often contains uninteresting text, like not enough citation warning, disamb info... [15:31:04] it was the enterprise beta release of structured data in french and english: https://enterprise.wikimedia.com/blog/kaggle-dataset/ [15:31:30] "includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections " [15:31:52] interesting [15:34:16] i've sometimes wondered if we should be separating sections in the text field, if only to prevent phrase matching across boundaries [15:34:44] but thats probably rarely a problem, i guess snippeting across the boundary might be more awkward [15:35:46] I thought we did? [15:36:04] perhaps mixing with something else [15:36:35] not really, text is just a giant long string. I was thinking if it was ["section 1 ...", "section 2 ...", ...] then it wouldn't snippet across sections due to the position increment [15:37:05] yes my bad, it's auxiliary_text that we split not text :/ [15:38:28] yes... it's kind of ugly to remove the section title and fold everything as giant string :( [15:39:25] indeed, i suspect thats part of why when we've suggested to people that they can use cirrus dumps as structured mediawiki data it never really fit use cases other than search [15:39:35] (and more, but the general unstructuredness of parts) [15:39:43] yes [15:57:26] heading out, have a nice long week-end [18:18:20] well, the mjolnir tests pass..the must mean it's going to work when i run it in the prod cluster...right? :P [18:19:02] * ebernhardson notes that we don't even have tests for the direct cli argument handlers, other than the fixtures for cli --help output on each one [18:20:05] but it means i can build the conda env in CI and start trying to run the commands from the airflow fixtures (modified to shrink datasets and write to different db) and see how it actually works [18:30:14] * ebernhardson wonders what i did differently...gabriele's test-env build has the branch name in the package version, mine says "-work" [18:31:56] oh, it's because i name my branches work/ebernhardson/, which is suggested in some wikitech doc iirc. It put the full thing in there which causes it to make a path of the branch name...wonder if i should just use normal-ish branch names [21:25:14] * ebernhardson realizes, after accidentally doing a test run with the old venv, that because i only changed code that runs on the driver i didn't actually need to build and ship a custom venv to hadoop...