Page MenuHomePhabricator

Two page titles on zhwiktionary inaccessible after ICU upgrade (to 52.1)
Open, NormalPublic

Description

Each ICU update that adds data for new Unicode characters may affect NFC normalization, and thus MediaWiki title normalization. As stated in T86096, Wikimedia sites updated from ICU 4.8.1.1 to 52.1; these correspond to Unicode 6.0 and 6.3 respectively. According to the list-unicodeset utility on unicode.org (which uses ICU 57.1 and Unicode 8.0 data), 52 of the added characters are not "NFC inert", meaning each may decompose or interact with other characters during the normalization process (see ICU documentation and UAX#15, section 9.1 Stable Code Points). ICU 52.1 reports the same.

These 52 characters are, in PCRE regex format: [\x{8e4}-\x{8fe}\x{1bab}\x{1cf4}\x{a674}-\x{a67b}\x{a69f}\x{aaf6}\x{fa2e}\x{fa2f}\x{11100}-\x{11102}\x{11127}\x{11131}-\x{11134}\x{111c0}\x{116b6}\x{116b7}]. This is the same set I mentioned in T86096#2322920, plus U+11131 and U+11132.

If a page title has any of these characters, it might change when renormalized under the newer version of Unicode. Because MediaWiki normalizes user input, including titles specified in URLs, the page will thus become inaccessible in the normal way.

To check for pages that would become inaccessible, I downloaded all the "20160501-all-titles.gz" data dumps (for public wikis). There are 5 wikis that have at least one existing page whose title contains one or more of those 50 characters: enwiktionary, incubatorwiki, jawiktionary, mgwiktionary, and zhwiktionary. Most of these titles are in fact in NFC according to Unicode 6.3 as implemented by ICU 52.1. Two pages on zhwiktionary have titles that are not (see DB query):

In both cases (U+FA2E and U+FA2F) , the character is in the CJK Compatibility Ideographs block, the page was created by Sz-iwbot (operated by @Shizhao), and the normalized title already exists as a separate page (U+90DE and U+96B7). It is only possible to access these pages by specifying their IDs. Links in the interface point to the wrong page.

There's a script to fix page titles that are not in NFC or otherwise are invalid: cleanupTitles.php. Currently, however, it has several serious limitations. In particular, only the page table is updated. The script does not update inbound links (which could be fixed by blanking each page containing a broken link to the renamed page, then undoing the edit), outbound links (including category sort keys, which could be fixed by blanking the renamed page, then undoing the edit), the logging table, the archive table, the image/oldimage/filearchive tables (for files), extension tables, the summary and content of each stored revision, or anything outside the main DB. Some also are problems Wikimedia encountered with namespaceDupes.php, and they have not even been completely addressed there.

Since the number of pages affected by this update is so small, it may be possible to use API action=move to rename each page to some other title, specifying fromid instead of from (as mentioned in T87645#1039605), or use API action=delete and pageid to delete each page. That would still leave a log entry and corresponding null revision in the database for each old, non-normalized title, and as @tstarling notes in namespaceDupes.php, there may be parts of the code making the bad assumption that the old title is still a valid, normalized one.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 30 2016, 3:34 PM
Aklapper triaged this task as Normal priority.Jun 9 2016, 3:04 PM
Aklapper added projects: I18n, Regression.

Is this fixed? I can now load those URLs.

Is this fixed? I can now load those URLs.

Still pseudo-redirected

Liuxinyu970226 added a comment.EditedJun 28 2017, 4:22 AM

@Shizhao Hi, do you absolutely believe that this affects the entire Wiktionary ? If yes, could you please provide some same examples from small Wiktionaries, if not, why do you add this tag?

Restoring tag for now per my meta-wiki message from nemo bis

Amire80 moved this task from Untriaged to Unicode support on the I18n board.Apr 2 2018, 11:59 AM