Loosen GWToolset's file name restrictions (parentheses, apostrophes, ampersands, etc.)
Closed, ResolvedPublic

Description

GWToolset is uploading a lot of file names like https://commons.wikimedia.org/w/index.php?title=File:The_King_of_Hungary_holding_council_in_his_tent_on_the_battlefield_-_Froissart--39-s_Chronicles_-Volume_IV-_part_2-_-1470-1475--_f.84_-_BL_Harley_MS_4380.jpg&redirect=no . The proper name is [[commons:File:The King of Hungary holding council in his tent on the battlefield - Froissart's Chronicles (Volume IV, part 2) (1470-1475), f.84 - BL Harley MS 4380.jpg]] (and it has since been renamed to that)

Notice how things like ), (, ', & are being stripped and replaced with '-'. This is wrong, those characters are perfectly valid in a title.

Even worse, characters like apostraphe (') are being converted to their html entity "'", with &, # and ; being replaced with dashes, resulting in "--39-". This is wrong, as html entities in titles should be converted to the character they represent, and that character should be dealt with as appropriate (As is done in normal titles)

See: https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/03#Renaming_multiple_files.3F


Version: unspecified
Severity: normal

Details

bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz62909.
Bawolff created this task.Mar 21 2014, 2:17 AM

(In reply to MZMcBride from comment #1)

Related:
http://lists.wikimedia.org/pipermail/glamtools/2014-March/000035.html

And from the email:

'#','<','>','[',']','|','{','}',':','¬','`','!','"','£','$','^','&','*','(',')','+','=','~','?',',',';',"'",'@'

Many of these characters are very common in file names (apostraphes, parenthesis) and absolutely allowed both socially and technically.

I think that GWToolset should simply follow $wgIllegalFileChars and the things that Title::secureAndSplit blocks (To be specific, only blacklist '#','<','>','[',']','|','{','}', and ':'). If there really is a need for additional characters being blacklisted for social reasons (I'm not convinced there is), then the black list should be configurable on wiki as mediawiki: namespace message, since social conventions change over time.

Sorry, to be more specific (because I got questions), GWToolset should use the built in function wfStripIllegalFilenameChars instead of trying to re-implement title validation rules in Utils::stripIllegalTitleChars.

This bug is also about html entities, so the full process for normalizing the title should be:

  1. Run through Sanitizer::decodeCharReferences()
  2. Run through wfStripIllegalFilenameChars()
  • working on a patch

Change 121094 had a related patch set uploaded by Dan-nl:
relax wiki title restrictions

https://gerrit.wikimedia.org/r/121094

Change 121094 merged by jenkins-bot:
relax wiki title restrictions

https://gerrit.wikimedia.org/r/121094

The fix for this issue is scheduled to be deployed on commons on Tuesday, 8 April 2014

Change 125401 had a related patch set uploaded by Dan-nl:
wfStripIllegalFilenameChars truncates title

https://gerrit.wikimedia.org/r/125401

Change 125401 merged by jenkins-bot:
wfStripIllegalFilenameChars truncates title

https://gerrit.wikimedia.org/r/125401

Things seem to me to work better now, but transforming auto. titles with illegal characters is IMO not a good approach. The reason is that there is no way to track these files (after transformation). Why not simply checking this just after the XML upload and telling that something is wrong with the titles (and listing them)?

(In reply to Kelson [Emmanuel Engelhart] from comment #10)

Things seem to me to work better now, but transforming auto. titles with
illegal characters is IMO not a good approach. The reason is that there is
no way to track these files (after transformation). Why not simply checking
this just after the XML upload and telling that something is wrong with the
titles (and listing them)?

This is moved to bug 65070. Marking as closed.

  • Bug 64843 has been marked as a duplicate of this bug. ***
Gilles moved this task from Untriaged to Done on the Multimedia board.Dec 4 2014, 10:11 AM
Gilles raised the priority of this task from "High" to "Unbreak Now!".
Gilles lowered the priority of this task from "Unbreak Now!" to "High".Dec 4 2014, 11:23 AM

Add Comment