[21:39:13] Krinkle: did you see my work yesterday on https://phabricator.wikimedia.org/T292552 ? It turns out to be not as straightforward as we thought [21:41:22] TimStarling: I noticed the wmf-config patch, but I see that has been abandoned since. Hadn't see the latest yet. [21:41:54] changing mbUpperChar LGTM, assuming only used by ucfirst, e.g. not by anything uc() more generally. Codesearch agrees. May be worth clarifying in function doc as possible future trap. [21:43:12] Did you end up running the script to migrate existing titles to not rely on the ucfirst map we have today? [21:45:04] * Krinkle notices "Obligatory redundant license notice. Exception to the GPL's … clause hereby granted" [21:46:15] ah, I see. this isn't about the next step, this is about removing the php72 emulation, and we don't want what native does as-is. [21:47:40] right so we're changing the target, before running the script. Hm... I guess you'd run the script with the undeployed map since we can't deploy the new map now. Or possibly the new map using php74's TITLE_CASE as base, can be deployed if it is done atomically with the core changes in both branches. [21:49:24] apart from the entries in the proposed smaller ucfirst map, are there any other differences between php72 and php74-with-TITLE_CASE? It's not obvious whether there are. If there are none, and we're basically changing nothing from user POV, then that seems uncontroversial. [21:54:21] I can update the rollout plan on the task [21:54:59] basically 1. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842242/ [21:55:07] 2. notify users [21:55:44] 3. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/842028/ and parent [21:56:07] 4. script execution [21:56:23] 5. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842243 [21:57:04] not sure how 2. notify users will work, may need a new script to be written for that [21:58:06] the differences between php72 and php74-with-TITLE_CASE are https://phabricator.wikimedia.org/P35451 [22:01:00] the affected characters are ligatures, greek diacritics (which are sematically ligatures), circled and roman numeral forms, some new latin diacritics, and the new medefaidrin block [22:02:03] the resulting page move list is https://paste.tstarling.com/p/oWcAGs.html [22:02:33] this is excluding eszett [22:08:34] note that some of the ligature changes are reverts of T219279, for example we had Dz and dz becoming DZ but now dz will become Dz [22:08:35] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [22:11:23] List looks fairly uncontroversial to me, but yeah, I guess Tech News notice would be appropiate [22:12:55] I'm trying to reproduce the maps to validate my understand of how they're made and effective CR them that way [22:14:13] I was thinking of making a meta page with the conflicting page moves (i.e. the normalized move target already exists), because there isn't many of them and they ideally need human judgement [22:14:45] yeah [22:14:57] but before I start talking to the community, I need in-principle approval on the new case mapping [22:16:45] to make the title case maps you will need https://gerrit.wikimedia.org/r/c/mediawiki/core/+/842030 [22:17:59] run generateUpperCharTable.php to produce JSON files for PHP 7.2, PHP 7.4 and PHP 7.4 title case [22:18:34] then use generateUcfirstOverrides.php to produce diffs between the JSON files [22:24:13] Error: You are using an unsupported PHP version (PHP 7.2.34). [22:24:13] :) [22:24:38] Probably best check out an older head instead of working around given how quickly things have changed [22:25:13] heh, yeah I actually uninstalled PHP 7.2 before I realised I would need that map file, so I just used the one I generated previously which was on mwmaint1002 [22:26:51] mwmaint1002.eqiad.wmnet:/home/tstarling/T292552/uc7.2 [22:32:44] ack got it. I got `master@{3 weeks ago}` but my source of php72 came without php-intl, and then the image in question was too old to fetch apt-get without security errors, I'll just take this for now. [22:36:53] result of `php7.4 maintenance/language/generateUcfirstOverrides.php --override php72-upper.json --with php74-upper-titlecase.json --outfile ucfirst-php72-to-php74title.php` [22:37:23] is entirely different from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842242/ [22:37:41] wrong way around at least [22:37:47] but also not using multiple --override files as input [22:39:46] yes, that change preserves PHP 7.2 case mappings so you have to use the 7.2 file as the --with argument [22:40:06] `/usr/local/opt/php@7.4/bin/php maintenance/language/generateUcfirstOverrides.php --override php74-upper.json --override php74-upper-titlecase.json --with php72-upper.json --outfile ucfirst-php74title-to-php72.php` [22:40:36] This contains (after eszet) `+ 'ʼn' => 'ʼN',` which looks right [22:40:46] but your patch doesn't contain that [22:41:16] or rather, it's not right, that's what we want to do long term, not what the map should undo in the interim. [22:48:02] I used your uc7.2 file as php72-upper.json and with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/842030/1 checked out, generated php74-upper.json and php74-upper-titlecase.json, and ran generateUcfirstOverrides after that. [22:49:42] what I have is essentiallly https://phabricator.wikimedia.org/P35451 plus eszet. [22:50:11] you are saying that PHP 7.2 maps ʼn to ʼN? [22:50:20] I'm not getting why it's going toward ʼN [22:52:29] what are the md5sums of the three JSON files? [22:54:58] $ md5sum php* [22:54:58] d7c641bd2f8cd938fd278cc0b38bbab5 php72-upper.json [22:54:58] 4c03ab06f50a2f318395389f398acea9 php74-upper-titlecase.json [22:54:58] 207a0aca01365161b9eb1e264a291cbb php74-upper.json [22:57:32] php > $map = json_decode(file_get_contents('./php72-upper.json')); [22:57:32] php > echo "\u{02BC}n"; [22:57:32] ʼn [22:57:32] php > var_dump(@$map["\u{02BC}n"]); [22:57:32] NULL [22:57:32] php > var_dump(@$map["\u{02BC}N"]); [22:57:32] NULL [23:00:58] php > $x = '\U+00C5\U+0089'; [23:00:59] php > echo $x; [23:00:59] ʼn [23:00:59] php > var_dump(@$map[$x]); [23:00:59] NULL [23:01:41] there's probably something changing its encoding so this way of checking is probably not gonna work. copying characters is futile. [23:02:10] there was a problem with my files which I'm fixing, I'll see if that reproduces your results [23:02:47] ok. wasn't expecting to find an actual bug. let's see. [23:07:26] so some of my PHP 7.4 files were actually generated on PHP 8.0, and I thought that I had verified that that didn't make a difference [23:07:58] there are some differences, so I will fix that, but not in the 'ʼn' => 'ʼn' line [23:11:52] my second line is the single character U+0149 not a composing sequence [23:12:24] TimStarling: https://phabricator.wikimedia.org/P35484 [23:13:12] whatever version I got a hold of to put in my editor, exists in my 74 files but not 72 [23:13:41] did I mention I use php74 on macos from homebrew? [23:15:22] probably means I'm using a newer icu that's different from what debian or wmf compiled into php74, but still it's getting a similar entry in the 74 map. [23:15:33] I'm guessing it's encoded slightly differently for some reason [23:16:09] that paste looks fine, at a glance [23:17:19] when I run the same in mediawiki-docker php74 with wmf's packages, I get the same [23:17:39] so I guess that rules out most of the macOS/homebrew ICU difference [23:17:58] the part thats weird is that the char appears missing from the php74 map [23:18:16] 72* map [23:18:51] + 'ʼn' => 'ʼN', [23:18:54] same output as before [23:19:18] but is that a composing sequence or a single codepoint [23:20:21] ok, whatver it was, it's gone now. turns out, I had two slightly differently named files and I was still running `git diff` in wmf-config over the old file [23:20:37] so everything I did wasn't makkng a difference because I wasn't using the new file [23:21:12] let me show you the diff I have now [23:22:33] https://phabricator.wikimedia.org/P35484#147188 [23:22:37] the 'N thing is no longer there [23:22:54] it's leaving it the way your gerrit patch does [23:23:02] there's a few other diffs. do those match what you got? [23:23:14] Or upload yours and I'll re-diff [23:24:32] it would be easier to debug this if the PHP files used \u escape sequences [23:30:20] aye. I think for a Toolforge thing I ended up abusing json_encode for this [23:30:33] and then re-formatting it from its \u form [23:30:47] php > echo "\u{1F418}"; [23:30:47] 🐘 [23:30:48] php > echo mb_ord("\u{1F418}", 'UTF-8'); [23:30:48] 128024 [23:31:27] php > echo base_convert(128024, 10, 16); [23:31:27] 1f418 [23:31:44] I'm not sure what that diff is showing, why does it have so many weird mappings on the RHS? [23:32:17] like 'ᾀ' => 'ἈΙ', that is a PHP 7.4 uppercase mapping that we want to avoid [23:35:54] https://en.wikipedia.org/wiki/ᾀ [23:36:03] (Redirected from ᾈ) [23:36:13] appears to be php72 status quo [23:36:29] so going back to that I guess is expected right now? [23:38:24] 'ᾀ' => 'ᾈ' [23:38:30] this is what we have in prod today [23:38:40] ah it's a little different [23:38:59] right, in my output it's uppercasing the second half of that as well [23:39:03] not sure why [23:40:08] ok, updated paste at https://phabricator.wikimedia.org/P35484#147188 [23:40:22] I made another typo somewhere between two files [23:40:22] right, in later unicode the iota subscript is lowercase, so you can't keep the subscript if you uppercase the whole thing [23:44:43] I updated https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842242 to fix the PHP 8.0 stuff that crept in [23:45:16] so now it only adds 12 ligatures [23:45:20] $ diff -u wmf-config/UcfirstOverrides.php ../../mediawiki/ucfirst-php74title-to-php72.php [23:45:20] 0 [23:45:21] LGTM [23:45:45] (that zero is my bash PS1) [23:46:40] ok, so having reviewed and understood the source of the map somewhat, what do we need for " in-principle approval" or is the both of us agreeing sufficient you think? [23:47:06] noting that announcing the links on a wiki page and giving time for checking will happen first ofc [23:49:39] if you are approving then that is good enough for me [23:51:44] OK :) [23:52:12] TimStarling: I'll land the maint change now yeah? - https://gerrit.wikimedia.org/r/c/mediawiki/core/+/842030/1 [23:52:55] yes please [23:53:38] also a +1 on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/842242 would be nice [23:58:22] then I'll update the task description with the new plan, and start talking to the community about page moves and user renames