[01:15:50] Join the Adiutor (https://www.mediawiki.org/wiki/Extension:Adiutor) MediaWiki Extension Test! [01:15:51] [01:15:53] 🔗 Test Site: https://adiutor.wmcloud.org/ [01:15:54] [01:15:56] We're calling on our Wikipedia contributors to test the new Adiutor extension, designed to moderate, triage and maintain tasks easier. Your feedback is crucial in refining this extension for our community. [01:15:57] [01:15:59] *Try out features like:* [01:16:00] [01:16:02] Editors can create a speedy deletion request for a page. [01:16:03] Editors can propose a page for deletion. [01:16:05] Editors can request page protection. [01:16:06] Editors can request a page move. [01:16:08] Editors can tag articles with various maintenance tags. [01:16:09] [01:16:11] Share your experience: [01:16:12] Help us improve Adiutor by providing your insights and suggestions after testing. [01:16:14] [01:16:15] Feel free to adjust it further as needed. [14:09:35] I'm currently building a corpus of sentences for use in usage examples in our lexemes. [14:09:36] I started with a swedish source of 600k governement documents -> 1TB data [14:09:38] Additionally there are the europarl corpus and the newer Digital Corpus of the European Parliament [14:09:39] [14:09:41] In total this will result in a huge database with sentences separated by language and analyzed by spaCy and tokens identified which can then be linked on lexeme forms and as a source for usage examples. [14:09:42] I'm currently exploring how to set this up in WMF cloud using the Trove database, if anyone would like to help you are very welcome to contact me. [14:09:44] [14:09:45] There is a host of use cases for this data. E.g. we might be able to derive Wikifunctions that help generate plural and other forms which are currently missing for e.g. sv and da. [14:16:49] Hi, In case this is useful - I have been preparing a huge sentence dataset extracted from Wikipedia for 300+ languages. The dataset consists about 80 million sentences, semi automatically cleaned, and tagged with language codes. https://analytics.wikimedia.org/published/datasets/one-off/santhosh/wikisentences/ [14:16:50] Source code https://github.com/santhoshtr/wikisentences [14:16:51] Sometime in early next year, planning to publish it a dataset in huggingface (re @dpriskorn: I'm currently building a corpus of sentences for use in usage examples in our lexemes. [14:16:53] I started with a swedish source of 600k g...) [14:23:22] This would be nice to include also. But the sentences in Wikipedia should probably not be used as usage examples on lexemes, but they can easily be excluded. (re @sthottingal: Hi, In case this is useful - I have been preparing a huge sentence dataset extracted from Wikipedia for 300+ languages. The data...)