Page MenuHomePhabricator

Node.js 10 changes encoding for at least one Georgian character
Open, Needs TriagePublic

Description

The MCS diff tests usually run on Node.js 6.
After temporarily switching to Node.js 10 on one of our development machines we noticed that the first character in one canonical title pointing to Georgian wiki (kawiki) is now different. This patch reverts the change that was made accidentally with Node.js 10.

Which version is the correct one? Should we be concerned about that?

Event Timeline

bearND created this task.Feb 7 2019, 5:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2019, 5:53 PM
bearND renamed this task from Node.js 10 changes encoding for at least Georgian character to Node.js 10 changes encoding for at least one Georgian character.Feb 7 2019, 5:53 PM
bearND added projects: I18n, Services.
bearND updated the task description. (Show Details)
bearND added a subscriber: mobrovac.
Pchelolo added a subscriber: Pchelolo.

Interesting!

So, %u10DE is 'GEORGIAN LETTER PAR' and it's supported from Unicode v1.1.0. The %u1C9E is GEORGIAN MTAVRULI CAPITAL LETTER PAR and it's only supported from Unicode v11.0 which was released in June 2018.

So apparently nodejs has updated the version of Unicode it's using, so in node 6 the title normalization first letter capitalization didn't do anything, with a new version of Unicode is actually correctly capitalizing the letter.

I'm not quite sure what can we do here - there will always be disparity between PHP and Node version of Unicode, and I don't think we can do anything about it. Adding exceptions for all the corner cases will not be manageable, I think we just should not worry and hope these situations are rare enough.

cscott added a subscriber: cscott.Feb 21 2019, 5:31 PM

Probably the unicode version should be considered part of the content version string, so that running in node 10 is considered a different "API version" than running in node 6, even if there were no other code changes?

Yes, we should bump the version number when we switch to Node 10. Do you propose to add something else Unicode specific there as well?

bearND added a comment.EditedMar 27 2019, 3:17 AM

Seems related: T208139

Alan.H added a subscriber: Alan.H.Mar 28 2019, 7:16 PM

The first one is correct. See Unicode 11.0.0 changelog

Casing Issues
Casing behavior for the Georgian script has changed significantly. There is a new set of Mtavruli capital letters (U+1C90..U+1CBA, U+1CBD..U+1CBF) in Unicode 11.0, with case mappings to the existing Mkhedruli letters (U+10D0..U+10FA, U+10FD..U+10FF). In prior versions of the Unicode Standard, Mkhedruli Georgian was considered a monocameral (non-casing) script, and the Mkhedruli Georgian letters were gc=Lo. Starting with Version 11.0, those Mkhedruli Georgian letters are now gc=Ll, and have uppercase mappings to Mtavruli Georgian capital letters. This change will have major implications for Georgian implementations, including changes for input methods, fonts, casing, and string matching. Existing implementations have treated Mtavruli headlines and other uses for textual emphasis as a text style, so there will also be significant issues for document conversion and upgrade.

Another complication for Georgian is that the primary orthography does not use titlecasing, and the Mkhedruli Georgian letters do not have titlecase mappings to Mtavruli letters. This is unique among bicameral systems in the Unicode Standard, so casing implementations should be prepared for this exception.

Is this ticket actionable? We have to go up to node 10, soon, regardless.

Ah, I see some comments above about bumping version numbers, at least.