[12:00:26] Hello, I'm currently working on a project to analyze and identify vandalism and content-integrity violations on frwiki, and in order to do this I'm trying to setup a spark cluster with the revision history as a data source. What I currently do is that I download the last xml dumps, and convert them to parquet files. This process is very memory and [12:00:27] bandwidth intensive, so I was wondering if maybe one of you here has already worked on this, or maybe the wmf has a parquet dataset I haven't heard of ? [12:00:33] I have contacted the analytics team on #wikimedia-analytics, and it seems that this resource is accessible through their data lake. Sadly it is not publicly available, so I was wondering if one of the people working on the integrity would be interested in this, and could help me request a shell access, as this access requires to be in touch with [12:00:34] someone collaborating at the WMF. [17:59:21] he ywats0ns [18:08:57] Hello dsaez_ [18:16:31] regarding your question, I'm working on this: https://phabricator.wikimedia.org/T314384 [18:17:12] to the best of my knowledge there is no such a dataset [18:17:27] and moving long parquet files could be difficult.. [18:17:54] but depending which kind / size of data you are looking for, maybe some of the data we have prepared could be useful for you [18:29:23] I'm looking first to find users who inserted certains URL, and then I'd like to try to do some analysis on reverted revisions to improve our edit-filters [20:30:37] ywats0ns: I use history dumps available in toolforge, in /public/dumps/public directory, in the python script I use the bz2 module, so the script read and process the dumps without the need to pre-download or pre-decompress [20:34:55] I use it to create this graph https://commons.wikimedia.org/wiki/File:Ptwiki_references_in_articles.png for ptwiki, the script took 7 hours to read all the ptwiki history dumps [20:40:51] 7h only to read all of ptwiki revision history and analyze the references ? [20:47:55] yes, at first I made some tests with part of the dump, when I fixed the bugs I run in all dump and back in the next day to see the result [20:53:21] That's impressive [20:53:50] I'll try this in the next days then