Page MenuHomePhabricator

Migrate MediaWiki extensions away from UtfNormal in MediaWiki core to external UtfNormal library
Open, Needs TriagePublic

Description

MediaWiki's UtfNormal library has been split into a separate wikimedia/utfnormal Composer package in MediaWiki version 1.25. However some extensions still use it:

vagrant/mediawiki/extensions$ ack UtfNormal:: --php
AbuseFilter/extensions/AntiSpoof/AntiSpoof_body.php
344:		$testName = UtfNormal::toNFKD( $testName );

ActiveAbstract/AbstractFilter.php
113:		return UtfNormal::cleanUp( $clipped );
217:			$header = UtfNormal::cleanUp( $stripped );

AntiSpoof/AntiSpoof_body.php
354:		$testName = UtfNormal::toNFKD( $testName );

Foxway/Foxway.body.php
50:		return \UtfNormal::cleanUp($return);
85:		return \UtfNormal::cleanUp($return);

LiquidThreads/classes/Hooks.php
229:		$out .= UtfNormal::cleanUp( Xml::tags( 'DiscussionThreading', null, $threadInfo ) . "\n" );

PdfHandler/PdfHandler.image.php
141:					$pages[$page] = UtfNormal::cleanUp( $pageText );

PhpTags/PhpTags.body.php
74:		return \UtfNormal::cleanUp($return);
137:		return \UtfNormal::cleanUp($return);

Scribunto/common/Hooks.php
136:			return UtfNormal::cleanUp( strval( $result ) );

Scribunto/engines/LuaCommon/lualib/ustring/make-normalization-table.php
72:if ( !UtfNormal::$utfCheckNFC ||
73:	!UtfNormal::$utfCombiningClass ||
74:	!UtfNormal::$utfCanonicalDecomp ||
75:	!UtfNormal::$utfCanonicalComp
79:if ( !UtfNormal::$utfCompatibilityDecomp ) {
98:foreach ( UtfNormal::$utfCheckNFC as $k => $v ) {
99:	if ( isset( UtfNormal::$utfCombiningClass[$k] ) ) {
109:foreach ( UtfNormal::$utfCombiningClass as $k => $v ) {
118:foreach ( UtfNormal::$utfCanonicalDecomp as $k => $v ) {
130:foreach ( UtfNormal::$utfCompatibilityDecomp as $k => $v ) {
131:	if ( isset( UtfNormal::$utfCanonicalDecomp[$k] ) && UtfNormal::$utfCanonicalDecomp[$k] === $v ) {
148:foreach ( UtfNormal::$utfCanonicalComp as $k => $v ) {

Scribunto/engines/LuaCommon/UstringLibrary.php
167:		return array( UtfNormal::toNFC( $s ) );
175:		return array( UtfNormal::toNFD( $s ) );
183:		return array( UtfNormal::toNFKC( $s ) );
191:		return array( UtfNormal::toNFKD( $s ) );

Translate/ffs/SimpleFFS.php
120:		$input = UtfNormal::cleanUp( $input );

TranslationNotifications/extensions/Translate/ffs/SimpleFFS.php
120:		$input = UtfNormal::cleanUp( $input );

Transliterator/Transliterator_body.php
219:			$word = UtfNormal::toNFD( $word );
482:		return UtfNormal::toNFC( $output );

These extensions should all be fixed in order to clean the MediaWiki core code base of fallback code.
With exception of Scribunto/engines/LuaCommon/lualib/ustring/make-normalization-table.php which does some complex incantations with class aliasing, this task is very straightforward (see this example patch).

Event Timeline

MaxSem created this task.Dec 23 2016, 12:58 AM
Restricted Application added subscribers: TerraCodes, Aklapper. · View Herald TranscriptDec 23 2016, 12:58 AM

@MaxSem: Thanks. Do you plan to mentor this in GCI, or have somebody in mind?

Aklapper updated the task description. (Show Details)Dec 31 2016, 8:05 PM
Reedy added a subscriber: Reedy.Jan 3 2017, 5:34 PM

I'll mentor

Aklapper renamed this task from Migrate extensions off core UtfNormal to Migrate MediaWiki extensions away from UtfNormal in MediaWiki core to external UtfNormal library.Jan 4 2017, 2:40 PM
Aklapper updated the task description. (Show Details)Jan 4 2017, 2:47 PM
Aklapper closed this task as Invalid.Jan 4 2017, 2:51 PM
Aklapper updated the task description. (Show Details)
Aklapper reopened this task as Open.Jan 4 2017, 2:55 PM
Aklapper moved this task from Proposed tasks to Imported in GCI Site on the Google-Code-In-2016 board.

Split into two GCI tasks and excluded Scribunto for the time being (we can still open an advanced task for that repo):

Change 330723 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330723

Change 330725 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330725

Change 330727 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330727

Change 330728 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330728

Change 330723 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330723

Change 330725 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330725

Change 330728 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330728

Change 330727 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330727

Change 330894 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330894

Change 330895 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330895

Change 330897 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330897

Change 330899 had a related patch set uploaded (by MtDu):
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330899

Change 330899 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330899

Change 330894 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330894

Change 330895 merged by jenkins-bot:
Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330895

Change 330897 merged by jenkins-bot:
[mediawiki/extensions/Transliterator@master] Migrate away from UtfNormal in core to external UtfNormal library

https://gerrit.wikimedia.org/r/330897

Framawiki moved this task from Backlog to Doing on the good first bug board.Dec 2 2017, 1:31 PM
Verdy_p added a subscriber: Verdy_p.EditedFeb 11 2019, 1:54 PM

mw.ustring.toNFD(str) and mw.ustring.toNFKD(str) do not work as expected for all modern Korean Hangul syllables:

  • the algorithmically composed syllables (LVT or LV forms using basic jamos) in range U+AC00..U+D7AF are still not decomposed at all
    • the current implementation only uses the simple decomposition mapping pairs found in the UCD
    • most decomposition mappings are found in the UCD for Korean, except those for Hangul LVT and LV syllables using "modern simple jamos".
    • only the decomposition "legacy jamos" (some of them are of type VV, there are also some LVT or LV forms but using legacy jamos nor part of the Hangul precomposed syllable ranges) are in the UCD !

The decomposition of modern Hangul TLV or TL syllables, to be used by toNFD (and indirectly by toNFKD) is only part of the Unicode standard (chapter 3) and is specified algorithmically :

  • divide the codepoint in this range (U+AC00..U+D7AF) by 28, use the remainder (if it's not zero) for encoding the T jamo
  • then use the quotient by 21, and use the remainder for encoding the V jamo, and use the quotient for encoding the L jamo.

Use the reverse two operations to make the composition to NFC (or indirectly to NFKC).

A basic test shows that in the console:

= #mw.ustring.toNFD('ᄉᄐ') // (L jamo, V jamo)
6 // OK: 6 bytes for 2 codepoints (3 bytes in each codepoint)
= #(mw.ustring.toNFC('ᄉᄐ'))
6 // bad; should be 3 bytes for 1 codepoint (LV syllable)
= #(mw.ustring.toNFC('ᄹ')) // what the previous should have returned
3 // OK, the precomposed LV syllable is kept as 1 codepoint (using 3 bytes each)

Modern Korean is not working as expected. The results are in fact always renormalized unexpectedly in NFD form (the canonical recomposition does not occur for NF*C), so:

  • toNFD() returns the same as toNFC(), and
  • toNFKD() returns the same as toNFKC().

This makes impossible to handle correct handling of modern Korean with these functions in mw.ustring.

Particularly, when searching for Korean text usually envoded in NFC form, or making transforms will return separate sets for texts depending on whever they are in NFD or NFC forms, and we need to check all possible forms for each cluster (this expands to exponentially complex searches for texts longer than 1 syllabic cluster, and it is impractical). Text matching then fails and forgets many cases.

As well indexing (e.g. sorting categories with category keys) may fail as well, with some texts being sorted by their leading jamo, others being sorted on the precomposed clusters.

This is offtopic. Please file a separate bug.

mw.ustring.toNFD(str) and mw.ustring.toNFKD(str) do not work as expected for all modern Korean Hangul syllables:

  • the algorithmically composed syllables (LVT or LV forms using basic jamos) in range U+AC00..U+D7AF are still not decomposed at all
    • the current implementation only uses the simple decomposition mapping pairs found in the UCD
    • most decomposition mappings are found in the UCD for Korean, except those for Hangul LVT and LV syllables using "modern simple jamos".
    • only the decomposition "legacy jamos" (some of them are of type VV, there are also some LVT or LV forms but using legacy jamos nor part of the Hangul precomposed syllable ranges) are in the UCD !

The decomposition of modern Hangul TLV or TL syllables, to be used by toNFD (and indirectly by toNFKD) is only part of the Unicode standard (chapter 3) and is specified algorithmically :

  • divide the codepoint in this range (U+AC00..U+D7AF) by 28, use the remainder (if it's not zero) for encoding the T jamo
  • then use the quotient by 21, and use the remainder for encoding the V jamo, and use the quotient for encoding the L jamo.

Use the reverse two operations to make the composition to NFC (or indirectly to NFKC).
A basic test shows that in the console:

= #mw.ustring.toNFD('ᄉᄐ') // (T jamo, L jamo)
6 // OK: 6 bytes for 2 codepoints (3 bytes in each codepoint)
= #(mw.ustring.toNFC('ᄉᄐ'))
6 // bad; should be 3 bytes for 1 codepoint (TL syllable)
= #(mw.ustring.toNFC('ᄹ')) // what the previous should have returned
3 // OK, the precomposed TL syllable is kept as 1 codepoint (using 3 bytes each)

Modern Korean is not working as expected. The results are in fact always renormalized unexpectedly in NFC form, so:

  • toNFD() returns the same as toNFC(), and
  • toNFKD() returns the same as toNFKC().

This makes impossible to handle correct handling of modern Korean with these functions in mw.ustring.

Verdy_p added a comment.EditedMar 10 2019, 8:09 AM

This is offtopic. Please file a separate bug.

This is perfectly on topic and most probably caused by the change of implementation of "UtfNormal" and a different integration; the new integration makes some internal assumption about the intermediate normalization forms (or the internal UTF encoding used).

Previously the normalization functions were working as expected. This is no longer the case since this work on splitting the library (this bug may have occured earlier).

You still need to add the basic safety checks above to make sure that the canonical algorithmic Hangul (de)composition really occurs (this is specific to Hangul, all other Unicode uses the UCD canonical pairs).

On this topic, you replaced

$word = UtfNormal::toNFD( $word );

by:

$word = Validator::toNFD( $word );

I.e. you switched from one implementation of toNFD to another one.

There's an evidcence that there are (or there has been) two distinct implementations in existing code or these implementations have changed somewhere and one of them is now incorrect.

Anomie added a subscriber: Anomie.Mar 12 2019, 2:25 PM

I.e. you switched from one implementation of toNFD to another one.
There's an evidcence that there are (or there has been) two distinct implementations in existing code or these implementations have changed somewhere and one of them is now incorrect.

No we didn't. We took the implementation that was in MediaWiki core and split it out to a separate library. If you read https://gerrit.wikimedia.org/r/plugins/gitiles/utfnormal/+/master/README.md it says as much at the bottom.

Yes, that did include a change of class name, but the code was the same. It may, of course, have been updated since.

Please direct further discussion of your belief that utfnormal's NFD transformation is incorrect to T217973.