Author: ludovic.arnaud
Description:
This is a proposal for replacing the current implementation of UtfNormal with a
faster one. The class is meant as a drop-in replacement and no change is
required to MediaWiki's files except for some test files relying on private
methods. (Utf8Test.php, see below)
The methods implemented are:
- cleanup
- toNFC
- toNFKC
- toNFD
- toNFKD
To the best of my knowledge, the only functionnal difference with current
implementation is that all the methods checks for the well-formedness of input
and replace ill-formed UTF with replacement chars (on output, the original
string is always left untouched).
The code, which was developped on PHP 5.1.2 then tested on 4.4.2, passes both
UtfNormalTest.php's and Utf8Test.php's tests. I don't have access to the
utfnormal PHP extension, therefore I could only rely on original comments to
anticipate its behaviour. (see "UnicodeString constructor" in file's comments)
Note that Utf8Test.php needs to be edited in order to work with this new
implementation, so here is an informal diff:
- $stripped = $line; - UtfNormal::quickisNFCVerify( $stripped ); + $stripped = UtfNormal::toNFC( $line );
The expected performance improvement is between 5% (when using the utfnormal PHP
extension) and somewhere above 1000% (without the PHP extension) for some edge
cases involving long ASCII text and some Unicode chars whose NFC_QC property
value is "Maybe" or "No". For the record, during testing some texts (taken from
a dump of the fr wikipedia) were benchmarked at 80x the original speed. :)
Note that UtfNormalGenerate.php must be run to regenerate the data files before use.