Loosen GWToolset's file name restrictions (parentheses, apostrophes, ampersands, etc.)
Closed, ResolvedPublic

Description

GWToolset is uploading a lot of file names like https://commons.wikimedia.org/w/index.php?title=File:The_King_of_Hungary_holding_council_in_his_tent_on_the_battlefield_-_Froissart--39-s_Chronicles_-Volume_IV-_part_2-_-1470-1475--_f.84_-_BL_Harley_MS_4380.jpg&redirect=no . The proper name is [[commons:File:The King of Hungary holding council in his tent on the battlefield - Froissart's Chronicles (Volume IV, part 2) (1470-1475), f.84 - BL Harley MS 4380.jpg]] (and it has since been renamed to that)

Notice how things like ), (, ', & are being stripped and replaced with '-'. This is wrong, those characters are perfectly valid in a title.

Even worse, characters like apostraphe (') are being converted to their html entity "'", with &, # and ; being replaced with dashes, resulting in "--39-". This is wrong, as html entities in titles should be converted to the character they represent, and that character should be dealt with as appropriate (As is done in normal titles)

See: https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/03#Renaming_multiple_files.3F


Version: unspecified
Severity: normal

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz62909.
Bawolff created this task.Via LegacyMar 21 2014, 2:17 AM
Bawolff added a comment.Via ConduitMar 21 2014, 5:40 AM

(In reply to MZMcBride from comment #1)

Related:
http://lists.wikimedia.org/pipermail/glamtools/2014-March/000035.html

And from the email:

'#','<','>','[',']','|','{','}',':','¬','`','!','"','£','$','^','&','*','(',')','+','=','~','?',',',';',"'",'@'

Many of these characters are very common in file names (apostraphes, parenthesis) and absolutely allowed both socially and technically.

I think that GWToolset should simply follow $wgIllegalFileChars and the things that Title::secureAndSplit blocks (To be specific, only blacklist '#','<','>','[',']','|','{','}', and ':'). If there really is a need for additional characters being blacklisted for social reasons (I'm not convinced there is), then the black list should be configurable on wiki as mediawiki: namespace message, since social conventions change over time.

Bawolff added a comment.Via ConduitMar 26 2014, 2:01 PM

Sorry, to be more specific (because I got questions), GWToolset should use the built in function wfStripIllegalFilenameChars instead of trying to re-implement title validation rules in Utils::stripIllegalTitleChars.

This bug is also about html entities, so the full process for normalizing the title should be:

  1. Run through Sanitizer::decodeCharReferences()
  2. Run through wfStripIllegalFilenameChars()
dan-nl added a comment.Via ConduitMar 26 2014, 2:47 PM
  • working on a patch
gerritbot added a comment.Via ConduitMar 26 2014, 3:54 PM

Change 121094 had a related patch set uploaded by Dan-nl:
relax wiki title restrictions

https://gerrit.wikimedia.org/r/121094

gerritbot added a comment.Via ConduitApr 1 2014, 11:35 PM

Change 121094 merged by jenkins-bot:
relax wiki title restrictions

https://gerrit.wikimedia.org/r/121094

Bawolff added a comment.Via ConduitApr 1 2014, 11:40 PM

The fix for this issue is scheduled to be deployed on commons on Tuesday, 8 April 2014

gerritbot added a comment.Via ConduitApr 11 2014, 2:18 PM

Change 125401 had a related patch set uploaded by Dan-nl:
wfStripIllegalFilenameChars truncates title

https://gerrit.wikimedia.org/r/125401

gerritbot added a comment.Via ConduitApr 14 2014, 7:57 PM

Change 125401 merged by jenkins-bot:
wfStripIllegalFilenameChars truncates title

https://gerrit.wikimedia.org/r/125401

Kelson added a comment.Via ConduitApr 17 2014, 2:59 PM

Things seem to me to work better now, but transforming auto. titles with illegal characters is IMO not a good approach. The reason is that there is no way to track these files (after transformation). Why not simply checking this just after the XML upload and telling that something is wrong with the titles (and listing them)?

JeanFred added a comment.Via ConduitMay 9 2014, 2:01 PM

(In reply to Kelson [Emmanuel Engelhart] from comment #10)

Things seem to me to work better now, but transforming auto. titles with
illegal characters is IMO not a good approach. The reason is that there is
no way to track these files (after transformation). Why not simply checking
this just after the XML upload and telling that something is wrong with the
titles (and listing them)?

This is moved to bug 65070. Marking as closed.

dan-nl added a comment.Via ConduitMay 13 2014, 9:31 AM
  • Bug 64843 has been marked as a duplicate of this bug. ***
Gilles added a project: Multimedia.Via WebDec 4 2014, 9:35 AM
Gilles raised the priority of this task from "High" to "Unbreak Now!".Via WebDec 4 2014, 10:11 AM
Gilles moved this task to Closed on the Multimedia workboard.
Gilles lowered the priority of this task from "Unbreak Now!" to "High".Via ConduitDec 4 2014, 11:23 AM

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.