[00:09:31] I like the framing of “negotiate between the tests and the implementation”, I think :) [00:31:01] I would argue that rankings are not *inherently* useful, only useful or not useful for some specific purpose. for the newsletter, I'm not sure what other ranking would be more useful in that context, because the ranking doesn't seem relevant there, it's more like a random fact (re @mahir256: >34th language by number of Lexemes [00:31:02] /me rattles off about how that's not a useful ranking) [00:35:18] (it wouldn't change anything if it were 1st or 1000th, or if you counted senses or forms or something else... given that the newsletter goes on to talk about making imperative forms of verbs, perhaps the most relevant ranking would be number of verbs, but I'm not sure it would be any more *useful*) [00:58:05] I would roughly file it under random fact, but also showing that it is one of the medium sized languages in Wikidata. But yeah, I won't be making too big a claim of importance to the ranking :D [01:01:00] oh, sorry, I had to go to bed and then forgot to reply 😅 I was thinking of the latter, e.g. c'h is a letter in breton (re @Toby: Agreed, so we really need a few functions to satisfy each different aim? Regarding language-specific, are these punctuation char...) [01:11:31] Would you mind redoing them on Beta, so we can try it out. Then I'll check whether this is work as intended, or not. (re @Toby: I made Z12941 to test the new debug feature, and try to make it available to compositions, but when called from a composition, i...) [01:18:15] Can you link to the version where that happened? (re @Toby: I helped someone I know to submit a Python implementation, but they accidentally left a print statement in it, which returned a ...) [01:20:32] Mediawiki always applies NFC normalization. But that's not the problem here, right? It's about sorting? (re @Csisc1994: Unicode Normalization.) [01:22:17] If think emojis are not punctuation, but a similar function for removing emojis would also make sense (re @Toby: I'd appreciate additional eyes across the tests I've just added to Z11193. For example, when you "remove interpunction" would yo...) [01:25:05] My understanding is that the okina is a letter in Uzbek, but would be regarded as punctuation in an English text? Maybe? I don't know. What about the Apostroph? Or the exclamation mark in some Khoisan languages, where it represents a click sound I think? (re @Toby: Agreed, so we really need a few functions to satisfy each different aim? Regarding language-specific, are these punctuation char...) [06:54:38] It seems that the NFC Normalization has a problem is sorting Arabic diacritics. (re @vrandecic: Mediawiki always applies NFC normalization. But that's not the problem here, right? It's about sorting?) [07:08:00] I already raised that two years ago. [07:08:04] https://www.unicode.org/reports/tr53/tr53-3.html [07:23:15] https://phabricator.wikimedia.org/T23429 [07:40:28] /me [11:52:30] /me [11:52:40] /me [12:54:35] I don't think mediawiki is going to stop applying unicode normalisation (that would probably create more problems than it solves), and I doubt unicode would change the behaviour of the normalisation forms either (that would not be backwards compatible) [12:56:01] so if you want it to work better, I would suggest focusing on getting fonts to render the sequences properly (including support for the combining grapheme joiner for overriding the order), getting input methods to let you insert the combining grapheme joiner, and operating systems/software to do what is described here https://www.unicode.org/reports/tr53/#Other for backspacing [13:18:26] the underlying sequence of codepoints shouldn't matter to users, turning the input into the right sequence of characters is the job of the input method, e.g. the input method could turn the input shadda kasra into u+0650 u+0651 (kasra shadda) and the input kasra shadda into u+0651 u+034f u+0650 (kasra cgj shadda), and rendering them correctly is the job of the font/rendering engine [13:26:08] the input sequence not matching the codepoints isn't that unusual either. dead keys like found on many european keyboard layouts work by typing the diacritic and then the letter you want to apply it to, but the output will be either a single precomposed character, or the letter followed by a combining diacritic [15:09:41] easier said than done but yes I would like to improve the way fonts handle this [15:37:39] *nod* I don't expect it to be quick or easy, just more likely to result in any changes [15:47:01] I have enough experience with other scripts to know that unicode can implement things in ways that make things unnecessarily difficult, but we're kinda stuck with it >_< [15:49:49] for mongolian people keep coming up with other more logical/easier to implement encodings but it hasn't really improved the situation, now there's a bunch of different encodings and you have to be careful which fonts you select [15:53:52] I'm sorry, those are a draft specification and a long discussion. Could you write a bug of what you're trying to do and what prevents you from doing that? (re @Csisc1994: I already raised that two years ago.) [16:13:26] (the current non-draft version is https://www.unicode.org/reports/tr53/, by the way) [17:04:10] وَرَّدَ (to import). (re @vrandecic: I'm sorry, those are a draft specification and a long discussion. Could you write a bug of what you're trying to do and what pre...) [17:05:21] This verb includes an association of two diacritics: Fatha and Shaddah on R, the second Letter. [17:05:42] Here, Shaddah should be down and Fatha should be up. [17:07:08] When storing this as a Wikidata lexeme or as an input of a Wikifunctions function, the Shaddah will be up and the Fatha will be down. [17:08:23] This affects how diacritized Arabic texts can be processed using Wikifunctions and how Wikidata lexemes can be accurately represented. [17:08:35] that's a problem with the font you're using, the two should display the same [17:09:18] Not only the font. The two diacritics are inverted on Python using that Unicode Normalization. (re @Nikki: that's a problem with the font you're using, the two should display the same) [17:09:46] Looks the same to me? Am I missing sth? : https://tools-static.wmflabs.org/bridgebot/f970e4f2/file_57195.jpg [17:11:10] The Unicode Representation. [17:12:45] Sorry, was this meant to be the answer? [17:13:54] https://tools-static.wmflabs.org/bridgebot/8aae13a0/file_57196.jpg [17:14:04] https://tools-static.wmflabs.org/bridgebot/dd7d285c/file_57197.jpg [17:14:47] They look the same because people fixed the fonts. But, the Unicode still has the issue. [17:18:09] If we do not fix the problem, we need to apply several lines of code on the input to rearrange the Arabic diacritics before processing it. [17:20:41] the order of the diacritics doesn't have any meaning. https://www.unicode.org/reports/tr15/#Canon_Compat_Equivalence says "when correctly displayed should always have the same visual appearance and behavior." [17:21:25] the canonical order they chose isn't the one that would make the most sense, but it's the one they used, and they're very unlikely to change it [17:22:18] It will be important if we like to generate IPA from text or do pattern and root generation of the Arabic lexemes. [17:22:53] if you support unicode properly, both orders should produce the same output [17:23:58] Visually same. Not same. (re @Nikki: if you support unicode properly, both orders should produce the same output) [17:24:38] not just visually. they're supposed to behave the same too [17:25:16] Not really, because the diacritics are Unicode characters. [17:25:46] It is the same as converting ABCDE to ABDCE. [17:26:26] it's not, because unicode doesn't consider C and D canonically equivalent [17:28:29] Is it a way to restore the order of the diacritics after the normalization that shifts the diacritics. [17:37:51] That sounds like a viable solution (re @Csisc1994: If we do not fix the problem, we need to apply several lines of code on the input to rearrange the Arabic diacritics before proc...) [17:38:38] The functions would need to be aware of that, for equality testing on text and ordering, I assume, but any text can be generated correctly [17:38:57] It seems we can solve it within Wikifunctions functions [17:39:29] as for functions in wikifunctions, there should probably be a function which applies the algorithm in https://www.unicode.org/reports/tr53/ to the input that can be used for things like generating ipa