Two page titles on zhwiktionary inaccessible after ICU upgrade (to 52.1)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	PleaseStand
	May 30 2016, 3:34 PM

Description

Each ICU update that adds data for new Unicode characters may affect NFC normalization, and thus MediaWiki title normalization. As stated in T86096, Wikimedia sites updated from ICU 4.8.1.1 to 52.1; these correspond to Unicode 6.0 and 6.3 respectively. According to the list-unicodeset utility on unicode.org (which uses ICU 57.1 and Unicode 8.0 data), 52 of the added characters are not "NFC inert", meaning each may decompose or interact with other characters during the normalization process (see ICU documentation and UAX#15, section 9.1 Stable Code Points). ICU 52.1 reports the same.

These 52 characters are, in PCRE regex format: [\x{8e4}-\x{8fe}\x{1bab}\x{1cf4}\x{a674}-\x{a67b}\x{a69f}\x{aaf6}\x{fa2e}\x{fa2f}\x{11100}-\x{11102}\x{11127}\x{11131}-\x{11134}\x{111c0}\x{116b6}\x{116b7}]. This is the same set I mentioned in T86096#2322920, plus U+11131 and U+11132.

If a page title has any of these characters, it might change when renormalized under the newer version of Unicode. Because MediaWiki normalizes user input, including titles specified in URLs, the page will thus become inaccessible in the normal way.

To check for pages that would become inaccessible, I downloaded all the "20160501-all-titles.gz" data dumps (for public wikis). There are 5 wikis that have at least one existing page whose title contains one or more of those 50 characters: enwiktionary, incubatorwiki, jawiktionary, mgwiktionary, and zhwiktionary. Most of these titles are in fact in NFC according to Unicode 6.3 as implemented by ICU 52.1. Two pages on zhwiktionary have titles that are not (see DB query):

In both cases (U+FA2E and U+FA2F) , the character is in the CJK Compatibility Ideographs block, the page was created by Sz-iwbot (operated by @Shizhao), and the normalized title already exists as a separate page (U+90DE and U+96B7). It is only possible to access these pages by specifying their IDs. Links in the interface point to the wrong page.

There's a script to fix page titles that are not in NFC or otherwise are invalid: cleanupTitles.php. Currently, however, it has several serious limitations. In particular, only the page table is updated. The script does not update inbound links (which could be fixed by blanking each page containing a broken link to the renamed page, then undoing the edit), outbound links (including category sort keys, which could be fixed by blanking the renamed page, then undoing the edit), the logging table, the archive table, the image/oldimage/filearchive tables (for files), extension tables, the summary and content of each stored revision, or anything outside the main DB. Some also are problems Wikimedia encountered with namespaceDupes.php, and they have not even been completely addressed there.

Since the number of pages affected by this update is so small, it may be possible to use API action=move to rename each page to some other title, specifying fromid instead of from (as mentioned in T87645#1039605), or use API action=delete and pageid to delete each page. That would still leave a log entry and corresponding null revision in the database for each old, non-normalized title, and as @tstarling notes in namespaceDupes.php, there may be parts of the code making the bad assumption that the old title is still a valid, normalized one.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T95277 Erroneous Category:Category:Pages_with_script_errors
Resolved	Steinsplitter	T111594 Commons file stuck in category
Resolved	None	T87645 Existing pages without ability to reach and obviously wrong namespace
Duplicate	None	T109238 Clean up broken namespace pages across Wikimedia sites
Resolved	None	T136561 Two page titles on zhwiktionary inaccessible after ICU upgrade (to 52.1)
Resolved	matmarex	T195546 Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages
Resolved	Pppery	T196088 Get cleanupTitles.php into a good enough state that we could run it in production

Event Timeline

PleaseStand created this task.May 30 2016, 3:34 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 30 2016, 3:34 PM

Aklapper triaged this task as Medium priority.Jun 9 2016, 3:04 PM

Aklapper added projects: I18n, Regression.

Liuxinyu970226 added a parent task: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Jun 10 2016, 5:37 AM

Liuxinyu970226 subscribed.

Nemo_bis edited projects, added Jupyter-Hub; removed WMF-General-or-Unknown.Jul 25 2016, 8:42 AM

Nemo_bis edited projects, added Wikimedia-Language-setup; removed Jupyter-Hub.

Is this fixed? I can now load those URLs.

In T136561#2774950, @Nemo_bis wrote:

Is this fixed? I can now load those URLs.

Still pseudo-redirected

Aklapper added a project: Chinese-Sites.Dec 21 2016, 9:58 AM

Aklapper removed a parent task: T125033: [DO NOT USE] Chinese Wikimedia projects (tracking) [superseded by #Chinese-Sites].Dec 21 2016, 10:03 AM

Shizhao added a project: All-and-every-Wiktionary.Dec 27 2016, 6:57 AM

@Shizhao Hi, do you absolutely believe that this affects the entire All-and-every-Wiktionary ? If yes, could you please provide some same examples from small Wiktionaries, if not, why do you add this tag?

Liuxinyu970226 removed a project: All-and-every-Wiktionary.Jun 28 2017, 4:22 AM

Restoring tag for now per my meta-wiki message from nemo bis

Amire80 moved this task from Untriaged to Unicode support on the I18n board.Apr 2 2018, 11:59 AM

Framawiki moved this task from Backlog to Language specific bug tracking on the All-and-every-Wiktionary board.Jul 15 2018, 9:56 PM

As Mediaiwki has switched to 57.1 (which is still way behind the current stable version of ICU), is it still the case?

In T136561#6280740, @VulpesVulpes825 wrote:

As Mediaiwki has switched to 57.1 (which is still way behind the current stable version of ICU), is it still the case?

The examples given still effective (for example, go to https://zh.wiktionary.org/w/index.php?curid=1261527 and click on the 頁面 tab, it takes you to https://zh.wiktionary.org/w/index.php?curid=315030, an entire different page), so I think that the problem hasn't been resolved yet.

Pppery added a subtask: T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages.Jul 1 2024, 8:53 PM

Restricted Application added a subscriber: Stang. · View Herald TranscriptJul 1 2024, 8:53 PM

matmarex added a parent task: T109238: Clean up broken namespace pages across Wikimedia sites.Aug 12 2024, 6:22 PM

Pppery closed this task as Resolved.Aug 19 2024, 6:25 PM

Pppery closed subtask T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages as Resolved.

The previously inaccessible pages can now be found at https://zh.wiktionary.org/wiki/Special:PrefixIndex/T195546/ and should be moved to the correct titles or deleted.

Shizhao moved this task from MediaWiki core to Closed on the Chinese-Sites board.Aug 20 2024, 3:39 AM

Stang unsubscribed.Oct 24 2024, 3:25 AM

Two page titles on zhwiktionary inaccessible after ICU upgrade (to 52.1)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Two page titles on zhwiktionary inaccessible after ICU upgrade (to 52.1)
Closed, ResolvedPublic
Actions

Related Objects
Search...