Page MenuHomePhabricator

UtfNormal replacement proposal
Open, LowPublic

Description

Author: ludovic.arnaud

Description:
This is a proposal for replacing the current implementation of UtfNormal with a
faster one. The class is meant as a drop-in replacement and no change is
required to MediaWiki's files except for some test files relying on private
methods. (Utf8Test.php, see below)

The methods implemented are:

  • cleanup
  • toNFC
  • toNFKC
  • toNFD
  • toNFKD

To the best of my knowledge, the only functionnal difference with current
implementation is that all the methods checks for the well-formedness of input
and replace ill-formed UTF with replacement chars (on output, the original
string is always left untouched).

The code, which was developped on PHP 5.1.2 then tested on 4.4.2, passes both
UtfNormalTest.php's and Utf8Test.php's tests. I don't have access to the
utfnormal PHP extension, therefore I could only rely on original comments to
anticipate its behaviour. (see "UnicodeString constructor" in file's comments)
Note that Utf8Test.php needs to be edited in order to work with this new
implementation, so here is an informal diff:

-       $stripped = $line;
-       UtfNormal::quickisNFCVerify( $stripped );
+       $stripped = UtfNormal::toNFC( $line );

The expected performance improvement is between 5% (when using the utfnormal PHP
extension) and somewhere above 1000% (without the PHP extension) for some edge
cases involving long ASCII text and some Unicode chars whose NFC_QC property
value is "Maybe" or "No". For the record, during testing some texts (taken from
a dump of the fr wikipedia) were benchmarked at 80x the original speed. :)

Note that UtfNormalGenerate.php must be run to regenerate the data files before use.

Details

Reference
bz5303

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:10 PM
bzimport set Reference to bz5303.
bzimport added a subscriber: Unknown Object (MLST).

ludovic.arnaud wrote:

UtfNormal.php

attachment UtfNormal.php ignored as obsolete

ludovic.arnaud wrote:

UtfNormalGenerate.php

attachment UtfNormalGenerate.php ignored as obsolete

Lost in the mists of time. :(

Will take another look over this! :D

This code still works (with very minor modifications). I've attached a patch that should apply cleanly on the 1.17 branch (and most probably on trunk too).

The claim of ~5% improvement in time also still checks out (I didn't check with the php extension disabled though). Crude kcachegrind screenshots of a profiled run of dumpBackup.php with and without the patch here[1].

make test passes.

[1]: http://imgur.com/jxDxX&fVF35l

Created attachment 8300
Patch to make it apply cleanly 5 years after original was written

attachment utfnormal.diff ignored as obsolete

Comment on attachment 8300
Patch to make it apply cleanly 5 years after original was written

Contains the autogenerated data file as well.

Created attachment 8308
Smaller, Cleaner Partial Diff

Apply, then copy https://bugzilla.wikimedia.org/attachment.cgi?id=1456 and https://bugzilla.wikimedia.org/attachment.cgi?id=1455 to includes/normal.

attachment cleanDiff.diff ignored as obsolete

apergos (Ariel) ran two medium sized dumps (~5 hours on wmf clusters) and found a ~3% performance improvement with the new UtfNormal.

Testing without php5-intl on a smallish dump (~300000 revs) gives around 8.5x improvement.

Created attachment 8325
Patch with all defines moved to UtfNormalDefines.php

Removed a bunch of warnings.

attachment newUtfNormal.diff ignored as obsolete

Created attachment 8326
Clean Patch, without generated files.

attachment cleanDiff.patch ignored as obsolete

happy.melon.wiki wrote:

Patch applied in r84706.

happy.melon.wiki wrote:

Reverted in r84842, throws enormous numbers of E_STRICT errors on TranslateWiki. Please see comments on CodeReview for r84706 and r84709.

Created attachment 8342
Patch to fix issues mentioned in CR

Fixes issues mentioned in http://www.mediawiki.org/wiki/Special:Code/MediaWiki/84706

attachment cleanDiff.patch ignored as obsolete

Created attachment 8343
Patch to fix issues mentioned in CR

Forgot to add documentation in previous patch.

attachment cleanDiff.patch ignored as obsolete

Created attachment 8353
Patch to fix issues mentioned in CR

Re-added Brion's copyright header, as per Reedy's suggestion.

Attached:

sumanah wrote:

Yuvi, I'm sorry no one reviewed your patch in the last several months. Sadly, it no longer applies cleanly to trunk. Do you possibly have time to update it? Since you now have commit access, you could just commit your updated patch into SVN.

@Sumana: I think consensus when this was 'abandoned' was that this was a third-hand patch (someone wrote it originally, I just cleaned it up, and then someone else would have to review/commit this). And that made it too untrustworthy/fragile in terms of quality to be accepted. I doubt that has changed?

Should I WONTFIX this? Or leave it open for some other brave soul?

Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).

Should test it over MySQL (not just MariaDB), as its "utf8" encoding supports only characters in the BMP (encoded with 1 to 3 bytes).
MariaDB databases are setup with "utf8mb4" bye default (which is supported also in MySQL 5.5+) for getting support for supplementary characters (out of the BMP) that are encoded on up to 4 bytes.
Many MySQL installations are using the "utf8" encoding by default (because this is what is set in installation scripts provided by MediaWiki for MySQL), not all of them are using the "binary" encoding (which does not support collation and correct sort orders in queries, such as sorting categories).

Note that if a database back-end does not support supplemetnary characters, there should be a method to convert them to "safe" numeric character references (this will work for the normal text, but will not work for page names, such as those in the Gothic Wikipedia: such wiki would need either the unsafe "binary" encoding, or "utf_mb4" in MySQL5.5 or MariaDB installations)
For now there's no safe configuration variable in MediaWiki that properly detects if the backend really supports native UTF-8 for characters out of the BMP.

With MySQL the result is dramatic : pages may be truncated without notice AFTER they are edited and saved, even if the preview was correct (the preview is not generated from the current database store, but only from the text in memory of the PHP/HHVM engine, so the preview is correct).

You have two choices: either the database really supports full UTF-8, or if not, it supports at least the binary mode.

If we want to preserve at least basic collation (working correctly in the BMP at least) replacing supplementary characters by numeric character references would be enough (but page names would need to be stored in "binary" format and won't be collatable, and will be sorted in binary order (e.g. when listing members in categories), unless sorting is not performed by the backend SQL but by the local PHP code (it could be a problem when sorting overpopulated categories, because the engine will need to read the full content of the category to sort it, before generating a more limited page with 200 entries).

Support for supplementary characters is now needed: many peopel can easily enter Emojis on their smartphones. Chinese users can easily enter characters in SIP. All those users will cause pages to be truncated where they edit them.

So I suggest the addition of a support for converting the wikitext to use numeric character references (however in pagenames numeric character references are known to cause problems, e.g. {{PAGENAME}} already returns some NCRs for a few ASCII characters, that must be decoded using {{#titleparts}} because some other parserfunctions or builtin keywords don't like those NCR, such as {{PAGESINCATEGORY:}})

Or at least provide a fast detector of non-BMP characters, that the wiki editor (or the visual editor) could use to detect unsafe characters that cannot be stored properly by the backend. MediaWiki stores the result and never look at the result by reading it again from the database.

I also suggest the addition of a small test code in the Mediawiki startup code to autodetect the capabilities of the engine for characters that are known to cause problems: this would feed some global variable and would activate some filters that the wiki editors could use. If the database backend is safe, then the filters will be no-op functions: only valid UTF-8 will be checked for correctness. But some filters shoudl still be present to prevent the use of some forbidden characters (such as most C0 and C1 controls, and "non-characters" like U+FFFF, or lone surrogates), and for filtering/transforming silently some newline controls (convert C0 "CR+LF" or isolated C0 "CR", or C1 Newline to C0 LF).

The idea is to accept only "safe" UTF-8, not just strict UTF-8 (which is still too permissive and dependant of the backend SQL support), and provide two kind of filters:

  • silent replacement (newline controls, or supplementary characters replaced by NCRs)
  • silent removal.

and both returning a flag forcing a page to be previewed again if filters changed something, instead of being saved directly !

For now MediaWiki has not been correctly tested to work with non-BMP characters. The code is full of occurences of "utf8" by default in the backend connector and installation scripts, instead of "utf8mb4" for MySQL 5.5+ and MariaDB.

Note that the current implementation found in Commons does not work!

All the Korean basic Hangul syllables (with LV or LVT forms and based on modern jamos only) in range U+AC00..U+D7AF, that are composed or decomposed only algorithmically according to TUS chapter 3, and are NOT found in the UCD datafiles!) have been forgotten.

This means that mw.ustring.toNFD() and mw.ustring.toNFKD() do not work with modern Korean. These functions always return values renormalized incorrectly to NFC, so

  • toNFD() unexpectedly returns the same as toNFC().
  • toNFKD() unexpectedly returns the same as toNFKC().

You can see this bug in Wikimedia Commons.

This new implementation of the normalizer is now used to replace the previous one in MediaWiki core and this replacement was done before properly testing this library for conformance.