U+200B ZERO WIDTH SPACE allowed in page titles
Related bugs: T5969: Unicode (UTF-8, utf8) compatibility (tracking); T16600, T7732; T4593, T3524 (regarding usernames)

A bug in the pywikipedia framework [1] showed up when editing interwikis. This caused bot wars [2], which have been fixed by removing the U+200B ZERO WIDTH SPACE from end of the page title where the problems happened [3].

Should this character be allowed in page titles? And, more specifically, at the end of a page title?

Characters from the range U+2000-U+200A are already treated as spaces (and replaced by underscores). Since Unicode 4.0, U+200B is no longer considered whitespace by the Unicode Consortium.

To cite Brion Vibber in T16600:

They're not technically illegal, but perhaps should be excluded as they
wouldn't be useful.

and in T3524:

*Invalid* characters (those that are illegal in XML or don't reliably cut and
paste) need to be outright blocked in titles.

Although U+200B ZERO WIDTH SPACE seems to cut-and-paste on windows, it's not something I'd call 'reliable' - selecting characters from the left, moving to the right, it's easy enough not to select the U+200B ZERO WIDTH SPACE at the end of the page title. As such, I think it's reasonable not to allow the character, or to replace it with an underscore.


To clarify; the pywikipedia bug was caused by calling .strip() on the page title. When working with Unicode < 4.0, this will strip the U+200B character (python < 2.7), with Unicode > 4.0, this will *not* strip the U+200B character (python >= 2.7).

I don't _think_ it should be legit to have this char at beginning/end of a title, as zero-width space is meant to be used as a separator to disable ligatures etc, and only makes sense within a span of non-whitespace characters.

In the middle of words, it may actually be required for some languages.

Correct behavior is _probably_ to strip this char from beginning/end during title normalization, while preserving it in the middle of words. Once that's done, a run of cleanupTitle & co should pretty transparently correct existing titles.

I would recommend double-checking the definitions and actual usage to make sure these assumptions are correct before changing rules, however.

A user reported the existence of these two pages on Portuguese Wikipedia:
which appear on lists such as
as if they were two identically named pages.

Fortunatelly, this seems to be the only title where this character was used (at least on ptwiki):

Just had a U+200B in a filename on my third party wiki, and I'd like to prevent that in the future, because it caused a bit of confusion and does not seem to serve any purpose.

But since this seems to be one of those endless-saga-bugs that never get addressed, is there an easy way for me as server-admin of said third party wiki to exclude this (and other) invisible character from filenames (pagenames too)?

edit: To answer my own question - apparently this is addressed by the Title-Blacklist on Wikipedia:

specifically the line