Page MenuHomePhabricator

Username of all whitespaces in German Wikipedia dump file
Closed, ResolvedPublic

Description

Author: triddle

Description:
A username consisting of all spaces made its way into the German Wikipedia dump file. The article it
happened on is at http://de.wikipedia.org/w/index.php?title=Negativ-Positiv_Verfahren&action=history

Since the username field is not marked as space-preserving Parse::MediaWikiDump completely ignored
its contents in this case. I have a feeling a username of all spaces is not supposed to be allowed to exist.

Tyler


Version: unspecified
Severity: normal
URL: http://de.wikipedia.org/w/index.php?title=Negativ-Positiv_Verfahren&action=history

Details

Reference
bz4312

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:00 PM
bzimport set Reference to bz4312.
bzimport added a subscriber: Unknown Object (MLST).

gangleri wrote:

Hallo!

If you go
to
http://de.wikipedia.org/w/index.php?title=Negativ-Positiv_Verfahren&action=history
and click on the "space" link
you will come to
http://de.wikipedia.org/wiki/Benutzer_Diskussion:%C2%A0
there to
http://de.wikipedia.org/wiki/Spezial:Contributions/%C2%A0
no email specified or emails from other users disabeled

The problem is known since August see
http://de.wikipedia.org/wiki/Benutzer_Diskussion:%C2%A0

The user name contains
Unicode Character 'NO-BREAK SPACE - U+00A0
http://www.fileformat.info/info/unicode/char/00a0/index.htm
HTML Entity (decimal)   (hex)   (named)  
UTF-8 (hex) 0xC2 0xA0 (c2a0) %c2%a0 %C2%A0

http://en.wikipedia.org/wiki/User:%C2%A0
is known already from
http://bugzilla.wikimedia.org/show_bug.cgi?id=1524#c9

Changing the name would be an administrative task either at WP:DE or better at
all projects. I do not know the policy about this. Please clarify this at the
local wiki, via a mailing list as [Wikide-l], [Wikitech-l] etc. or via IRC at
irc://irc.freenode.net/mediawiki .

Marking this bug as a duplicate of
bug 1524: usernames should use unicode whitelist

http://fr.wikipedia.org/wiki/%C2%A0 is mentioned at
bug 2173 comment 3
bug 2173: Fatal error when removing an article with an whitespace title from the
watchlist

best regards reinhardt [[user:gangleri]]

*** This bug has been marked as a duplicate of 1524 ***

avarab wrote:

This isn't a duplicate of bug 1524, that deals with having a whitelist for
registered usernames, but this particular username also happens to break the XML
schema.

gangleri wrote:

Thanks Ævar! I did not read the second paragraph with the attention that would
be required. Please look what happens at
http://en.wikipedia.org/wiki/User:%C2%A0
and
http://fr.wikipedia.org/wiki/%C2%A0

Please change the summary in order to reflect the new / major problem Thanks in
advance!

I don't understand, does this really break dumps?

Also wondering. How to exactly reproduce that it "breaks dumps"?

triddle wrote:

If the XML schema indicates data is not white space preserving then white space is not significant and there is no difference between " ", " ", " ", "\t\n\n\n\t\t\t\t\t\t\t\t\t \n\n]n" etc.

If a user name exists where white space is significant it becomes impossible to transmit using a non-space preserving data type. Thus it's not actually possible to get the user names correctly and this is rather broken.

FriedhelmW claimed this task.