Page MenuHomePhabricator

Upload Wizard field input validation
Closed, ResolvedPublic

Description

Upload Wizard allows user to enter text which is them added to Information and other templates. Some inputs break the templates and should not be allowed. Some examples

  1. "|" (pipe) character in any field breaks {{Information}} template and should be replaced with "{{!}}". They are sometimes found in source URLs like " http://www.britishmuseum.org/research/collection_online/collection_object_details.aspx?place=33529&plaA=33529-3-1&ILINK|34484,|assetId=106527001&objectId=282816&partId=1 " or authors add a pipe between their name and the name of some institution they belong to. See example problem file at File:الشيخ عبد الحليم محمود.jpg
  2. Geographic coordinates user inputs need to be numeric. Inputs like "50°4´50"" are not valid and should not be allowed, see File:Spartakiáda 1932.jpg for example of today's upload. Some inputs can be parsed and converted to numeric values.
  3. Many historical images use upload date instead of creation date. If user uses licenses under "The copyright has definitely expired in the USA", "This work was made by the United States government", or "Another reason not mentioned above" than there should be a warning if user tries to use today's date.
  4. Files uploaded license typed in "Another reason not mentioned above" field, need to be checked. The simplest way is to check if there is a template there (do we have {{...}} brackets), a better way is to check if the typed text transcludes {{License template tag}}, and not allow anything that does not. here is example of taday's upload without license.

Event Timeline

Jarekt raised the priority of this task from to Medium.
Jarekt updated the task description. (Show Details)
Jarekt subscribed.
Restricted Application added subscribers: Steinsplitter, Aklapper. · View Herald Transcript

"|" (pipe) character in any field breaks {{Information}} template and should be replaced with "{{!}}". They are sometimes found in source URLs like " http://www.britishmuseum.org/research/collection_online/collection_object_details.aspx?place=33529&plaA=33529-3-1&ILINK|34484,|assetId=106527001&objectId=282816&partId=1 " or authors add a pipe between their name and the name of some institution they belong to. See example problem file at File:الشيخ عبد الحليم محمود.jpg

I'm afraid that we can't do this, as users expect these fields to allow wikitext (such as links and inline templates). Maybe if, one day, we integrate VisualEditor into the wikitext fields, and allow switching between VE and WT… :)

Maybe we can look for "|" (pipe) characters in URLs, like the example above and other text which is not a template or a link to wikipedia article. Another commons source of broken information templates are "name | institution" format in the author field.

The validation of geographical coordinates would be good to add. Wrongly formatted coordinates generate a lot of problems during massive uploads, such as Wiki Loves Earth/Monuments contests.

@Atsirlin We already validate the coordinates in UploadWizard (and we only accept them in decimal form, T135599 is about changing that). Can you give an example?

@matmarex: is it a recent change in the Upload Wizard? I am here only because of the discussion at
https://commons.wikimedia.org/wiki/Commons_talk:Wiki_Loves_Earth_2016#Category:Media_with_erroneous_locations

T135599 seems to solve this problem. Then my comment is moot.

It is fairly recent (eb1da2a0fb4d30496e0c678605cce61248d0025f, 2016-05-04). Previously we would only reject coordinates that were correctly formatted as numbers, but out of range (e.g. "99" degrees latitude).

On a closer look, I think the current code still accepts values where the prefix is a number (e.g. "42asdf")… That would explain cases like this or this. I think the fix proposed for T135599 should resolve this.

I think technically the template could accept pure numbers and strings like "22| 01|11.065|N". However second option seems to complicated to explain. Pure numbers should be easy to recognize either by regexp or by converting them to numbers. I was not aware that any validation was being done, and some validation is better than none. https://commons.wikimedia.org/wiki/Commons_talk:Wiki_Loves_Earth_2016#Category:Media_with_erroneous_locations was mostly empty for last few years: either someone was cleaning it or the validation worked. However in last few months we soddenly got 5-6k files there from several sources. Maybe validation was recently broken?

Change 298277 had a related patch set uploaded (by Matthias Mullie):
Normalize all coordinate input to decimal degrees

https://gerrit.wikimedia.org/r/298277

Change 298277 merged by jenkins-bot:
Normalize all coordinate input to decimal degrees

https://gerrit.wikimedia.org/r/298277

That should resolve most problems with coordinates. Other issues mentioned in this task still stand. We should split off some subtasks to track the progress better.

All of the subtasks have been fixed!