Page MenuHomePhabricator

API is applying normalization that Title doesn't
Closed, DeclinedPublic

Description

Please see bug 38712. After long debugging session I found out that API normalizes page titles silently.

This confuses the heck out things like LQT, which has logic like if (!$title->exists()) { /* call api internally to create the page */ }

I know that LQT should be calling WikiPage->doEdit directly, but that is orthogonal to this bug.

I see few options how to fix:

  1. Do not normalize titles in API
  2. Throw an error if title does not normalize to the same as given
  3. Make Title constructors to normalize the title the same way

Given that from normal web viewing it is impossible to access the non-normalized title (goes to the normalized title), solution 3 looks most sensible.


Version: 1.21.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=27849
https://bugzilla.wikimedia.org/show_bug.cgi?id=33465

Details

Reference
bz45848

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:27 AM
bzimport set Reference to bz45848.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 11893
Script to illustrate the problem

Here is a script I used to test this.

Attached:

matmarex claimed this task.
matmarex subscribed.

The normalization happens long before API code is reached. All of the Unicode text MediaWiki stores is required to be in normal form C (NFC). The two titles used in your script:

$name1 = 'Test2%E2%80%93_%E0%A6%85%E0%A6%B8%E0%A6%AE%E0%A7%80%E0%A6%AF%E0%A6%BC%E0%A6%BE_%282%29';
$name2 = 'Test2%E2%80%93_%E0%A6%85%E0%A6%B8%E0%A6%AE%E0%A7%80%E0%A7%9F%E0%A6%BE_%282%29';

Are actually the same title in different forms. Try putting the decoded forms in http://www.fontspace.com/unicode/analyzer/. (I think Phabricator also normalizes Unicode on input, so pasting decoded versions here is pointless.)

The first uses:

  • U+09AF Bengali Letter Ya
  • U+09BC Bengali Sign Nukta

The second uses:

  • U+09DF Bengali Letter Yya

The first is in NFC, the second is not in any normal form. According to http://www.scarfboy.com/coding/unicode-tool?s=U%2B09DF, U+09DF becomes U+09AF + U+09BC when converted to NFC form.

You should always ensure all of the input to your code is in NFC. Usually the WebRequest class will take care of this for you, but if you bypass it (evil you) or take the data from somewhere else (like file names from filesystem), you need to convert it yourself. UtfNormal\Validator::cleanUp() seems to be the method to do it.

T40712: The issue came from Names.php which is part of MediaWiki core. It is not my job as a extension developer to normalize what MediaWiki provides to me.

An appropriate thing would be to track the issue in a more specific bug, no?

If non-NFC Unicode text is hardcoded anywhere in MediaWiki, then yes, that's a bug. It seems you fixed this one there.