Page MenuHomePhabricator

Inconsistent normalization of Æ and æ
Closed, ResolvedPublic

Description

The 'Æ' is a letter and is currently normalized differently. The upper case form is normalized to 'AE', the lower case to 'A'.


Version: master
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:17 AM
bzimport added a project: AntiSpoof.
bzimport set Reference to bz46531.
bzimport added a subscriber: Unknown Object (MLST).

The root cause is the equivalence file on our wiki: http://www.mediawiki.org/wiki/AntiSpoof/Equivalence_sets which is then copied under maintenance/equivset.in.

The file list uses the format:
<hexadecimal codepoint> <character> => [<hexadecimal codepoint>] <character>

The relevant part:

E6 æ => C6 Æ
E6 æ => 41 A
4D4 Ӕ => C6 Æ
4D5 ӕ => C6 Æ

Running maintenance/generateEquivset.php generates a PHP array of the list which uses the character for key. The codepoint E6 has two entries, I guess only the second one is taken in account.

Change 373149 had a related patch set uploaded (by Kaldari; owner: Kaldari):
[mediawiki/extensions/AntiSpoof@master] normalization of æ

https://gerrit.wikimedia.org/r/373149

Change 373149 merged by jenkins-bot:
[mediawiki/extensions/AntiSpoof@master] Fix normalization of æ

https://gerrit.wikimedia.org/r/373149

kaldari claimed this task.

Change 374167 had a related patch set uploaded (by Hashar; owner: Hashar):
[mediawiki/extensions/AntiSpoof@master] Revert "test: remove Comæ test which is broken"

https://gerrit.wikimedia.org/r/374167

I once removed a failed test. I have restored it with https://gerrit.wikimedia.org/r/#/c/374167/ and it fails with:

00:01:01.856 1) AntiSpoofTest::testCheckUnicodeString with data set #3 ('Comae', 'Comæ')
00:01:01.856 Failed asserting that two strings are equal.
00:01:01.856 --- Expected
00:01:01.856 +++ Actual
00:01:01.856 @@ @@
00:01:01.856 -'v2:COMAE'
00:01:01.856 +'v2:COMÆ'
00:01:01.856

:(

@hashar: "Comæ" currently normalizes to "COMÆ". I honestly have no idea how that test would have ever passed since AntiSpoof only does single character to single character mappings.

Change 374167 abandoned by Hashar:
Revert "test: remove Comæ test which is broken"

https://gerrit.wikimedia.org/r/374167