Page MenuHomePhabricator

GWToolset should assume non unicode characters are windows-1252 not iso 8859-1
Closed, InvalidPublic

Description

Not sure if this is really GWToolset fault but in any case: I had a file with (apparently) invisible characters − most likely a bad encoding on the GLAM side of the character “œ” in the word “chœur”.

For information, here was the original CSV line:
APMH00004270;MH0004270;Ile-de-France;93;Saint-Denis;93066;Basilique Saint-Denis;;Stalles du chur;;Mieusement, Médéric (photographe);;;;;Négatif;PA00079952;Ministère de la Culture (France) - Médiathèque de l'architecture et du patrimoine - diffusion RMN;http://www.culture.gouv.fr/Wave/image/memoire/0403/sap01_mh004270_v.jpg;http://data.iledefrance.fr/api/datasets/1.0/photographies-serie-monuments-historiques-1851-a-1914/images/c5745a81dcc784be3affbd50dcd5c526/;48.9354612, 2.3598354;http://www.culture.gouv.fr/Wave/image/memoire/0403/sap01_mh004270_p.jpg

And the XML:

<commons_title>Basilique_Saint-Denis_-_Stalles_du_chur_-_Saint-Denis_-_Médiathèque_de_l'architecture_et_du_patrimoine_-_APMH00004270.jpg</commons_title>

GWToolset uploaded the file with a character with the following title: “File:Basilique Saint-Denis - Stalles du chœur - Saint-Denis - Médiathèque de l'architecture et du patrimoine - APMH00004270.jpg.jpg”

The character « œ » is not displayed by MediaWiki (at least not in my browser/encoding/etc.)

https://commons.wikimedia.org/w/index.php?title=File:Basilique_Saint-Denis_-_Stalles_du_ch%C2%9Cur_-_Saint-Denis_-_M%C3%A9diath%C3%A8que_de_l%27architecture_et_du_patrimoine_-_APMH00004270.jpg.jpg&redirect=no

Maybe GWToolset should intercept that?


Version: unspecified
Severity: normal

Details

Reference
bz68724

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:26 AM
bzimport set Reference to bz68724.
bzimport added a subscriber: Unknown Object (MLST).

Ok. What happened is that the data was originally in a character set called windows-1252. In that character set "œ" is encoded as 0x9C. Somewhere along the lines, it got converted to utf-8, but during the conversion process it was assumed that the original data was in a character set called iso-8859-1. That character set uses 0x9C to mean "STRING TERMINATOR", which is an invisible character.

So the end result is the image had a title in MW with 0xC2 0x9C which is the UTF-8 code for "STRING TERMINATOR", instead of 0xC5 0x93 which is the UTF-8 code for LATIN SMALL LIGATURE OE.


Its hard to tell at what step the error occurred. If the conversion error happened in the csv->xml transformation then its not gwtoolsets fault. If the error occured in the xml->upload step, then it would be. Could you maybe upload the relavent csv and xml files as attachments (Copy and pasting into bugzilla comments messes with the encoding)

p.s. FWIW, C0 and C1 control characters including "STRING TERMINATOR" are valid title characters (Although on commons they are blacklisted via title blacklist)

Actually I strongly suspect that it would be an issue in the csv->xml conversion and not gwtoolset, since having a raw 0x9C in the xml file would make the xml file invalid.

(In reply to Bawolff (Brian Wolff) from comment #3)

Actually I strongly suspect that it would be an issue in the csv->xml
conversion and not gwtoolset, since having a raw 0x9C in the xml file would
make the xml file invalid.

Indeed, I loaded the CSV as UTF-8 (as I always do), using Python codecs.open(csv_file, 'r', 'utf-8'). Never suspected the CSV might be in windows-1252.

If "STRING TERMINATOR" is valid then I suppose all is fine. :) Marking as INVALID.

(In reply to Bawolff (Brian Wolff) from comment #2)

p.s. FWIW, C0 and C1 control characters including "STRING TERMINATOR" are
valid title characters (Although on commons they are blacklisted via title
blacklist)

You mean GWToolset ignores the title blacklist? That sounds bad.

(I noticed this error because the bot I fired to rename all the images of this batch choked on these 10 files with "STRING TERMINATOR" with an APIError. Not sure if the fault lies with Pywikibot, the MediaWiki API or something else, but such file titles are definitely a problem.

(In reply to Jean-Fred from comment #5)

(In reply to Bawolff (Brian Wolff) from comment #2)

p.s. FWIW, C0 and C1 control characters including "STRING TERMINATOR" are
valid title characters (Although on commons they are blacklisted via title
blacklist)

You mean GWToolset ignores the title blacklist? That sounds bad.

(I noticed this error because the bot I fired to rename all the images of
this batch choked on these 10 files with "STRING TERMINATOR" with an
APIError. Not sure if the fault lies with Pywikibot, the MediaWiki API or
something else, but such file titles are definitely a problem.

Yes. The 0xC9 should be blocked by the
.*\p{Cc}.* <casesensitive|errmsg=titleblacklist-custom-hidden-char> # Control characters

rule. Well such characters may technically be valid title characters according to MediaWiki. There is really no good reason to ever use them. Almost to the point where one might want to assume that things were converted wrong and automatically try and re-convert as if its windows-1252.

[As an offtopic aside, Commons also blocks all astral characters (Mostly dead languages and emoticons, but also a bunch of chinese-japanese-korean characters), which seems a tad bit restrictive for a multi-lingual project of the scope that commons is...]

I kind of changed my mind about this. See bug 69236