API is applying normalization that Title doesn't
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	• Nikerabbit
	Mar 7 2013, 3:58 PM

Description

Please see bug 38712. After long debugging session I found out that API normalizes page titles silently.

This confuses the heck out things like LQT, which has logic like if (!$title->exists()) { /* call api internally to create the page */ }

I know that LQT should be calling WikiPage->doEdit directly, but that is orthogonal to this bug.

I see few options how to fix:

Do not normalize titles in API
Throw an error if title does not normalize to the same as given
Make Title constructors to normalize the title the same way

Given that from normal web viewing it is impossible to access the non-normalized title (goes to the normalized title), solution 3 looks most sensible.

Version: 1.21.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=27849
https://bugzilla.wikimedia.org/show_bug.cgi?id=33465

Details

Reference: bz45848

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T29810 ID related API requests (tracking)
Resolved		• Nikerabbit	T40712 Special:FirstSteps overwrites existing threads for flag request if LQT is confused
Declined		matmarex	T47848 API is applying normalization that Title doesn't

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:27 AM

• bzimport added a project: MediaWiki-General.

• bzimport set Reference to bz45848.

• bzimport added a subscriber: Unknown Object (MLST).

• Nikerabbit created this task.Mar 7 2013, 3:58 PM

Created attachment 11893
Script to illustrate the problem

Here is a script I used to test this.

Attached:

test.php1 KBDownload

The normalization happens long before API code is reached. All of the Unicode text MediaWiki stores is required to be in normal form C (NFC). The two titles used in your script:

$name1 = 'Test2%E2%80%93_%E0%A6%85%E0%A6%B8%E0%A6%AE%E0%A7%80%E0%A6%AF%E0%A6%BC%E0%A6%BE_%282%29';
$name2 = 'Test2%E2%80%93_%E0%A6%85%E0%A6%B8%E0%A6%AE%E0%A7%80%E0%A7%9F%E0%A6%BE_%282%29';

Are actually the same title in different forms. Try putting the decoded forms in http://www.fontspace.com/unicode/analyzer/. (I think Phabricator also normalizes Unicode on input, so pasting decoded versions here is pointless.)

The first uses:

য U+09AF Bengali Letter Ya
় U+09BC Bengali Sign Nukta

The second uses:

য় U+09DF Bengali Letter Yya

The first is in NFC, the second is not in any normal form. According to http://www.scarfboy.com/coding/unicode-tool?s=U%2B09DF, U+09DF becomes U+09AF + U+09BC when converted to NFC form.

You should always ensure all of the input to your code is in NFC. Usually the WebRequest class will take care of this for you, but if you bypass it (evil you) or take the data from somewhere else (like file names from filesystem), you need to convert it yourself. UtfNormal\Validator::cleanUp() seems to be the method to do it.

T40712: The issue came from Names.php which is part of MediaWiki core. It is not my job as a extension developer to normalize what MediaWiki provides to me.

An appropriate thing would be to track the issue in a more specific bug, no?

If non-NFC Unicode text is hardcoded anywhere in MediaWiki, then yes, that's a bug. It seems you fixed this one there.

matej_suchanek mentioned this in T29849: API: add normalized info also for unicode normalization of titles.Aug 31 2016, 9:42 AM

• Phabricator_maintenance added a project: affects-translatewiki.net.Sep 26 2019, 1:53 PM

• Phabricator_maintenance removed a parent task: T41480: [DO NOT USE] Issues affecting translatewiki.net [superseded by #affects-translatewiki.net].Sep 26 2019, 1:55 PM