[15:28:16] isaacj: yeah looking forward to the event! mmm I guess I should also at least try to see how easy it might be to port the code to public sources. [15:30:17] it does run regexes over all content pages and most templates and modules, on all wikis [15:35:10] AndyRussG: oh yeah, that's going to be harder with public infrastructure just because there is no trivial way to parallelize those queries and it is a lot of pages [15:49:20] ah oki hmmm yeah it's more than 2 or 3 [15:50:51] in theory it could also be done with samples instead of the full data. in general I do want to learn how to use the public infrastructure. [15:57:37] another issue in this case is that I'm not sure it's worth it to do much more work on this particular metric. really it's mostly a better-than-nothing what-we-can-get-for-now measurement. any significant engineering effort would probably be better spent on new data collection (essentially, hooking into the same parser process used to fill up the wbc_entity_usage, and storing WD [15:57:39] requests without consolidating entries or throwing away data points like that table does) [17:45:26] well if you want to explore, I'd start with PAWS (https://wikitech.wikimedia.org/wiki/PAWS). it has local access to the dumps which you'd need because the wikitext isn't available in the MariaDB replicas. while PAWS doesn't have a ton of resources, it'd let you mock up the code for doing these analyses and test it on the samples or smaller wikis.