Page MenuHomePhabricator

MW's Title uppercase first letter code can lead to non-NFC titles
Open, Needs TriagePublicBUG REPORT

Description

MediaWiki generally assumes everything is NFC. However in certain cases, it appears that $wgLang->ucFirst() might cause normalized unicode to become unnormalized, which can cause some weird affects.

First: e.g. https://en.wikipedia.org/wiki/Greek_Extended - characters like ῷ or ῧ despite the fact that if you click on them you get to a real article. Its because Title is uppercasing them incorrectly, but when you go to the page, MW normalizes url parameters.

Second: if you go to a page like https://en.wikipedia.org/w/index.php?title=%E1%BF%B7&foo it shows you a missing page, however if you click create, then the page exists. This is because MediaWiki uppercases %E1%BF%B7 (U+1ff7 Greek Small Letter Omega with Perispomeni and Ypogegrammeni.) to %CE%A9%CD%82%CD%85 ( Greek Capital Letter Omega + Combining Greek Perispomeni + Combining Greek Ypogegrammeni.) but the correct uppercase would be %E1%BF%BC%CD%82" (Greek Capital Letter Omega with Prosgegrammeni + Combining Greek Perispomeni )

Event Timeline

DLynch subscribed.

Enjoyably, with useparsoid=1 a larger set of characters are redlinked.

Screenshots for posterity:

classic parserparsoid
CleanShot 2025-07-22 at 10.33.11@2x.png (1×1 px, 238 KB)
CleanShot 2025-07-22 at 10.33.49@2x.png (1×1 px, 239 KB)

In both cases, the redlinks are all to created pages, though they are all redirects.

Can confirm that mb_convert_case can return non-NFC characters:

$psysh
> mb_convert_case("\u{1ff7}", MB_CASE_TITLE)
= "ῼ͂"

> bin2hex(mb_convert_case("\u{1ff7}", MB_CASE_TITLE))
= "cea9cd82cd85"

> mb_strlen(mb_convert_case("\u{1ff7}", MB_CASE_TITLE))
= 3

Change #1171712 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Ensure Language::uc/ucfirst/lc/lcfirst/ucwords always result in NFC

https://gerrit.wikimedia.org/r/1171712

Change #1171714 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Fix Title-Casing of titles

https://gerrit.wikimedia.org/r/1171714

There's something else interesting going on on that page w/r/t the core/Parsoid difference:

Take έ U+03AD "Greek Small Letter Epsilon with Tonos" and its capital Έ U+0388
Compare έ U+1F73 "Greek Small Letter Epsilon with Oxia" and its capital Έ U+1FC9

Here they are next to each other: έέέέέέ See the difference?

It turns out that Validator::toNFC() converts U+1FC9 to U+0388. And the article is at U+0388, despite the link text being written on that page as [[U+1FC9]] (well actually [[U+1F73]] the lowercase version, but same thing).

The html live on en.wikipedia looks like <a href="/wiki/U+1FC9" title="U+1FC9">U+1F73</a> which is what I'd expect with unnormalized titles. Locally, with the Parsoid patch above, I get <a href="U+0388">U+1F73</a> when I parse [[U+1F73]].

I think the root cause is the same for all of these -- lack of NFC normalization after title-casing, but the legacy parser managed to NFC normalize a few more of them than Parsoid did.

The legacy output for ᾷ (U+1FB7) is:

<td title="U+1FB7: GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI"><a href="/w/index.php?title=%CE%91%CD%82%CD%85&amp;action=edit&amp;redlink=1" class="new" title="ᾼ͂ (page does not exist)">ᾷ</a>
</td>

Here again we have ᾼ͂ (U+0391 U+0342 U+0345) as the title case of ᾷ U+1FB7, which is not in NFC form:

> UtfNormal\Validator::toNFC(mb_convert_case("\u{1fB7}",MB_CASE_TITLE))
= "ᾼ͂" U+1FBC U+0342 

> mb_convert_case("\u{1fB7}",MB_CASE_TITLE)
= "ᾼ͂" U+391 U+0342 U+0345

> mb_convert_case("\u{1fB7}",MB_CASE_UPPER)
= "Α͂Ι" U+0391 U+0342 U+0399

I guess the real mystery is how the legacy parser got U+1F73 correct (the link is normalized to U+0388) while not normalizing U+1FB7 here.

Change #1171712 merged by jenkins-bot:

[mediawiki/core@master] Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks

https://gerrit.wikimedia.org/r/1171712

Change #1171714 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Fix title-casing and NFC normalization of titles

https://gerrit.wikimedia.org/r/1171714

Change #1181757 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a18

https://gerrit.wikimedia.org/r/1181757

Change #1181757 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a18

https://gerrit.wikimedia.org/r/1181757

Change #1182661 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182661

Change #1182661 merged by jenkins-bot:

[mediawiki/core@master] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182661

Change #1182664 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@wmf/1.45.0-wmf.16] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182664

Change #1182664 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.16] Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks"

https://gerrit.wikimedia.org/r/1182664

Mentioned in SAL (#wikimedia-operations) [2025-08-27T22:56:56Z] <arlolra@deploy1003> Started scap sync-world: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-27T23:01:19Z] <arlolra@deploy1003> arlolra, cscott: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-27T23:08:36Z] <arlolra@deploy1003> Finished scap sync-world: Backport for [[gerrit:1182664|Revert "Ensure NFC from Language::uc/ucfirst/lc/lcfirst/ucwords/ucwordbreaks" (T403113 T400057)]] (duration: 11m 40s)