[21:02:12] Hi everyone. I'm working on a project to generate some statistics on retention, which requires me to collect monthly edit counts for a lot of users across all wiki's. I'm currently running this code on PAWS. https://www.codedump.xyz/sql/ZMLaqQPRdnysoAeQ [21:02:12] I was wondering if a) there is a smarter way to collect these monthly edit counts. Am I overlooking a pre-aggregated dataset? b) if someone more familiar with the mediawiki tables sees an obvious way to make my code more efficient. (e.g. it's been running for 11 hours and counting on enwiki - dewiki went in 3-4 hours) [21:12:37] effeietsanders: I have a tool with user retention graph for all wikis https://retention.toolforge.org/ , I collected the data from replicas revision table in batches [21:15:21] danilo Thanks, really cool graphs. Unfortunately, I'm looking at a specific set of users that I want to know retention for across wiki's, so this tool won't do the trick for me. But if you see a way to share the pre-aggregated data somehow, that would be great [21:16:37] (I could generate a list of the usernames or user ID's I'm interested in, and would eventually love a table with one row per user and a column per month, across all wikis) [21:18:15] as background: I'm trying to understand what retention looks like for projects like Wiki Loves Monuments, and how that changed over time. Looking at your graphs, it's pretty obvious I'll need a benchmark though. [21:18:21] effeietsanders: I have this data in /data/project/retention/data , it has read access to anyone in toolforge [21:23:05] effeietsanders: each line is in format "u1111 22344 ..." where 1111 is user id, 22 is year (2022), 3 is month in hexadecimal and 44 is the number of edits in that month [21:24:52] Thanks! That sounds indeed like a great pre-aggregated set! [21:24:58] this is also on PAWS, or elsewhere? [21:25:22] no, only in toolforge [21:25:51] ah, that explains why I couldn't find it :) WHat is the easiest way to download/approach it? [21:26:54] you need a toolforge account and use ssh to access it [21:27:43] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Quickstart [21:28:38] i see [21:30:09] ok, i'll see if I can do that. I see you're also exposing the csv files one-by-one through the web interface. I'm guessing the entire batch is too big to offer as download? [21:30:59] (don't want to create work for you though - this is super helpful already! Just wondering if it's already there somehow :) ) [21:31:29] oh scrap that, those csv's are very different. Nevermind! :) [21:39:16] danilo I'm assuming you don't have existing functions to transform this data to something like a dataframe, right? I'll be able to do this myself, just avoiding double work. [21:40:16] what do you mean by dataframe? [21:41:00] e.g. pandas [21:42:35] I mean: the format is pretty specific (although straight forward). [21:43:44] no, my code only transform that data in those csv, and I was planning to make an agregate data for all wikis but I didn't finish the code to do that [21:47:22] effeietsanders: I created a link to the data directory in the ~/www/static and that make the directory available for download: https://tools-static.wmflabs.org/retention/data/ [21:49:07] you're a hero :)