Page MenuHomePhabricator

U+200B ZERO WIDTH SPACE allowed in page titles
Open, MediumPublic

Description

Related bugs: T5969: Unicode (UTF-8, utf8) compatibility (tracking); T16600, T7732; T4593, T3524 (regarding usernames)

A bug in the pywikipedia framework [1] showed up when editing interwikis. This caused bot wars [2], which have been fixed by removing the U+200B ZERO WIDTH SPACE from end of the page title where the problems happened [3].

Should this character be allowed in page titles? And, more specifically, at the end of a page title?

Characters from the range U+2000-U+200A are already treated as spaces (and replaced by underscores). Since Unicode 4.0, U+200B is no longer considered whitespace by the Unicode Consortium.

To cite Brion Vibber in T16600:

They're not technically illegal, but perhaps should be excluded as they
wouldn't be useful.

and in T3524:

*Invalid* characters (those that are illegal in XML or don't reliably cut and
paste) need to be outright blocked in titles.

Although U+200B ZERO WIDTH SPACE seems to cut-and-paste on windows, it's not something I'd call 'reliable' - selecting characters from the left, moving to the right, it's easy enough not to select the U+200B ZERO WIDTH SPACE at the end of the page title. As such, I think it's reasonable not to allow the character, or to replace it with an underscore.

[1] https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3182761&group_id=93107
[2] http://en.wikipedia.org/w/index.php?title=Podolsk&action=history
[3] http://bo.wikipedia.org/w/index.php?title=%E0%BD%94%E0%BD%BC%E0%BC%8B%E0%BD%91%E0%BD%BC%E0%BD%A3%E0%BC%8B%E0%BD%A6%E0%BD%B2%E0%BD%82&action=history


Version: unspecified
Severity: normal
See Also:
T16600: Illegal Unicode characters are allowed in pages
T3524: Usernames should use unicode whitelist
T4593: Non-printing characters allowed in registration
T7732: MediaWiki allows characters in the U+0080 to U+009F range
T44807: Invisible Unicode characters allowed on pagetitle (\u200E | \uFEFF | \u200B)
T57227: interwiki problems in km wikipedia
T57246: Problem with 0x200B ZERO WIDTH SPACE in page titles
T34717: Question: Bidi overrides and Unicode spaces removal from titles: why not zero-width space and horizontal tab?

Details

Reference
bz27446

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:19 PM
bzimport set Reference to bz27446.
bzimport added a subscriber: Unknown Object (MLST).

To clarify; the pywikipedia bug was caused by calling .strip() on the page title. When working with Unicode < 4.0, this will strip the U+200B character (python < 2.7), with Unicode > 4.0, this will *not* strip the U+200B character (python >= 2.7).

I don't _think_ it should be legit to have this char at beginning/end of a title, as zero-width space is meant to be used as a separator to disable ligatures etc, and only makes sense within a span of non-whitespace characters.

In the middle of words, it may actually be required for some languages.

Correct behavior is _probably_ to strip this char from beginning/end during title normalization, while preserving it in the middle of words. Once that's done, a run of cleanupTitle & co should pretty transparently correct existing titles.

I would recommend double-checking the definitions and actual usage to make sure these assumptions are correct before changing rules, however.

A user reported the existence of these two pages on Portuguese Wikipedia:
https://pt.wikipedia.org/wiki/Coming_Out_of_the_Dark
https://pt.wikipedia.org/wiki/Coming_Out_%E2%80%8B%E2%80%8Bof_the_Dark
which appear on lists such as
https://pt.wikipedia.org/wiki/Special:PrefixIndex/Coming_Out?uselang=en
as if they were two identically named pages.

Fortunatelly, this seems to be the only title where this character was used (at least on ptwiki):
http://tools.wmflabs.org/addshore/grep/?pattern=%E2%80%8B%E2%80%8B&lang=pt&wiki=wiki&ns=0

Just had a U+200B in a filename on my third party wiki, and I'd like to prevent that in the future, because it caused a bit of confusion and does not seem to serve any purpose.

But since this seems to be one of those endless-saga-bugs that never get addressed, is there an easy way for me as server-admin of said third party wiki to exclude this (and other) invisible character from filenames (pagenames too)?

edit: To answer my own question - apparently this is addressed by the Title-Blacklist on Wikipedia:

https://en.wikipedia.org/wiki/MediaWiki:Titleblacklist

specifically the line

.*[\x{00A0}\x{1680}\x{180E}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}].*