Page MenuHomePhabricator

No proper handling of multivalued fields
Open, MediumPublic

Description

Fields can have multiple values, in several different ways:

  • some file metadata (EXIF etc) fields can have multiple values
  • we parse some data from HTML code of license templates; some images have multiple license templates
  • sometimes the same property can have a value from both the file and the description
  • categories, and any properties based on categories, are in many-to-many relation with images
  • (there are also multi-languaged values which can be multivalued when all languages are requested, but we already deal with that)

Right now we handle this in a very hacky way for some fields (e.g. concatenate categories with "|") and don't handle it at all for most (one of the values is selected by some random aspect of the code). This will be especially problematic if we want to use CommonsMetadata as a helper tool for the Wikidata migration.

A proper multivalue handling should probably be able to:

  • indicate whether or not the given field is multivalued
  • indicate the source (e.g. if one of the values comes from the file, the other from the description, we should be able to somehow tell that)
  • synchronize properties somehow (e.g. a multilicensed image will have multiple license names and multiple license URLs; the user of the API has to be able to match the right name to the right link)

Version: unspecified
Severity: normal

Details

Reference
bz57259

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:24 AM
bzimport added a project: CommonsMetadata.
bzimport set Reference to bz57259.

We already use arrays with _type key for multilanguaged arrays (even though it is an ugly hack), so it seems logical to use the same format (_type=ul, see [1]) for multivalued properties.

We currently return an array with 'value' and 'source' fields for a single property; for multivalued properties we could maybe return such an array for each value, that would make it easy to indicate different sources (although ugly and not very compact).

For marking which values of multivalued properties belong tohether, we could maybe use an additional 'group' field (e.g. License, LicenseShortName etc. read from the first license template would have group=1).

[1] https://www.mediawiki.org/wiki/Manual:File_metadata_handling#Format_of_this_merged_metadata

  • Bug 64803 has been marked as a duplicate of this bug. ***
  • Bug 64888 has been marked as a duplicate of this bug. ***

Copying across a point raised in bug: 64888 which makes this issue more severe.

This bug results in dual licensed material pointing to the wrong license. I.e. the text will say e.g. "CC BY-SA 3.0" but will link to http://www.gnu.org/copyleft/fdl.html. Apart from being highly confusing it most likely violates one of the two licenses.

Example: https://sv.wikipedia.org/wiki/Sveriges_l%C3%A4n#mediaviewer/Fil:Greater_coat_of_arms_of_Sweden.svg

Change 135194 had a related patch set uploaded by Gergő Tisza:
[WIP] Handle multiple templates in TemplateParser

https://gerrit.wikimedia.org/r/135194

Change 135194 merged by jenkins-bot:
Handle multiple templates in TemplateParser

https://gerrit.wikimedia.org/r/135194

What’s the current status of this patch?

The issue described at bug:64888 is very problematic. I had not noticed that bug before, but frankly I would have been inclined to consider it a blocker for wide-deployment on Wikimedia sites.

(Here is another example if needed: https://commons.wikimedia.org/wiki/File:Silver_crystal.jpg#mediaviewer/File:Silver_crystal.jpg))

That specific issue should be fixed; CommonsMetadata handles multivalued fields correctly internally, but only returns one value due to limitations of the API format. This is done more consistently now.

The caching for CommonsMetadata is pretty complicated (there is a memcached layer on both the frontend and backend wiki, plus whatever the API framework uses, plus Varnish), so I am waiting to see if the issue is properly fixed (all the caches involved should wear out in 30 days) or some sort of manual purging will be necessary.

(In reply to Tisza Gergő from comment #8)

The caching for CommonsMetadata is pretty complicated (there is a memcached
layer on both the frontend and backend wiki, plus whatever the API framework
uses, plus Varnish), so I am waiting to see if the issue is properly fixed
(all the caches involved should wear out in 30 days) or some sort of manual
purging will be necessary.

Please give more steam on this issue. The acceptance of the MultimediaViewer at least in the German Wikipedia community is lowering from day to day due to such critical bugs :-(

All examples from this and duplicate tickets provide correct data now. Is anyone aware of images which are still showing inconsistent licence information?

Asking again before closing this ticket: Is anyone aware of images which are still showing inconsistent licence information?

Setting back the new, since the original issue described in comment 0 still stands. I'll assume the problem with mixing up different licenses is fixed.

...setting back the state to new...

Tgr removed Tgr as the assignee of this task.Dec 23 2014, 11:17 PM
Tgr renamed this task from No proper handling of multivalued fileds to No proper handling of multivalued fields.Jan 16 2016, 11:56 PM
Tgr set Security to None.