[08:36:59] Nikki Can your bot update data more frequently? [08:37:11] Lexical coverage data [10:26:40] no, it uses the lexeme dumps, and those are only produced weekly (re @cvictorovich: Nikki Can your bot update data more frequently?) [11:11:38] Hell! I would like rapid updates, hence I would see what’s still missing [12:08:03] it lists 1000 things which should be plenty for a single week :P if you've skipped a lot of them because they're not the right language or shouldn't have forms, you can add them to the "Filter" subpage for the language (see the description at the top of https://www.wikidata.org/wiki/Wikidata:Lexicographical_coverage) [12:08:24] if someone has made a big update to the Filter page, I can manually re-run the bot to replace the words that have been filtered out with new words, but it won't remove words for newly created lexemes until after the next dump [12:09:08] there's also https://ordia.toolforge.org/text-to-lexemes where you can paste any text you want [13:44:16] Lots of items aren't usable [13:44:35] They should be English words [13:51:25] It could be even better to support regex in your engine: many things that should be filtered out are in certain patterns [14:00:21] I added a filter, it should be cleaner tomorrow ;) (re @cvictorovich: Lots of items aren't usable) [14:05:35] Enumerating isn't an effective way (re @Nicolas: I added a filter, it should be cleaner tomorrow ;)) [14:21:59] well, not efficient but effective (re @cvictorovich: Enumerating isn't an effective way) [14:30:26] IMHO regex is an even more powerful and efficient way [14:33:30] I did use a regex to build the filter ;) (re @cvictorovich: IMHO regex is an even more powerful and efficient way) [14:39:15] and even if some could be filtered out automatically, it's not the case of all of them and it's still useful to have an explicit lis of exclusion to check if there no false-postives [14:41:01] out of curiosity, what regex would you use? (re @cvictorovich: IMHO regex is an even more powerful and efficient way) [14:52:34] ’ (re @Nicolas: out of curiosity, what regex would you use?) [14:52:50] I’ll include this in regex [14:53:24] French words with this are compound words [15:00:54] that what I did but : [15:00:54] - first there is the ’ and ' [15:00:56] - plus some words in French do have this character and are compound but also lexemes (presqu'île, bouton-d'or and a ot of names, patronyms and toponyms) [15:00:57] thus, we can't just rely on an automatic regex (re @cvictorovich: ’) [15:01:35] and since the tool can't do regex, the filter is the best solution available [15:02:48] But enumeration cannot be complete [15:03:27] indeed (re @cvictorovich: But enumeration cannot be complete) [15:04:02] but does it need to be? [15:04:43] Probably (re @Nicolas: but does it need to be?) [15:06:15] but how can it ever be complete? lexicograph works on dictionary for almost a millenia and none has ever be complete, [15:06:15] my philosophy is: let's focus on doable rather than perfect (re @cvictorovich: Probably) [15:21:39] Those words with superscript can be excluded [15:21:50] For example, 1er [15:56:51] yes these probably can, but there is only a limited number of them (and the exclusion filter is probably almost complete for them) (re @cvictorovich: Those words with superscript can be excluded) [20:23:32] hello great buyer