Page MenuHomePhabricator

Irrelevant part of Author field extracted from image metadata
Open, Needs TriagePublicBUG REPORT

Description

When inquiring metadata for Wikimedia Commons image [[File:Asian Highways 1 South Korea.jpg]] through API:
https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=extmetadata&titles=File:Asian%20Highways%201%20South%20Korea.jpg
The returned metadata field Artist equals to this plain text:
"This photo was taken with Samsung Anycall SPH-W2900"
However, this information comes from a generic, heavily used template {{Taken with}}, and is rather irrelevant in terms of proper attribution.
Instead, the piece of information that precedes that template within the image's Information template's Author field, which includes the actual attribution and is expected to show up, is missing from the extracted metadata:
[[:ko:User:쿠도군|쿠도군]] from kowp
I am using the API in a project to automatically provide attribution for images I am reusing from Commons and omissions like this result in the author not being credited properly.
Possibly related to the 10-year-old bug T68606.

Event Timeline

Tgr subscribed.

I'd say the solution is to not use templates unrelated to authorship in the Author field.

Possibly related to the 10-year-old bug T68606: Media viewer fails to give credit to all people in specific circumstances.

In a way, but that task is about the more thorny issue of multiple authorship-related templates being present in the Author field.

@Tgr Thanks for your reaction, but it does not solve the problem me and other users of the tool have.

By extracting metadata from Commons image pages, the tool aims at assisting with attribution, i.e. making sure that Wikimedians get credit for their work (and reusers fulfill license conditions). This is very useful when the reuser is not familiar with free licenses, or when attribution needs to be performed programatically (e.g. in my project, I am looking up illustrations for dictionary entries using Wikidata). After all, this is exactly why this tool has been implemented in Media Viewer or even on the simple image page on Wikimedia Commons.

Someone has since fixed the linked example. I would have not complained about a random item, but this file showed up on the very first place in my list of thousands of looked-up files, so it most probably is not a single file with these issues. Yes, one can find and fix all of them, but who will do that (and how? and how will we prevent this from recurring in future uploads?). I know the syntax was somewhat wrong, but come on, this is real-world data and the author still deserves credit even if he put some extra info into the Author field, doesn't he? I think us Wikimedians should always at least attempt at accommodating the tool to erroneous data, rather than insisting that all erroneous data needs to be fixed and the tool can stay as is. After all, isn't it this tool's purpose to try to resolve the mess of free-text metadata? If all data was perfect, there would be no need for this tool in the first place.

I can see three options, other than using the Author field to actually put the author there:

  • Just take whatever text is in the field (ie. disable the current heuristic which discards some of the text, maybe because of the presence of a description class in the device template, which usually indicates multilingual text?), so the author would be displayed as 쿠도군 from kowpThis photo was taken with Samsung Anycall SPH-W2900. Not that great.
  • Keep a list of all possible templates which someone might dump into an unrelated metadata field, and filter them out. The Taken_with template doesn't contain any identifiable class whatsoever, but I guess someone could fix that. How many different templates could conceivably end up being misplaced, though?
  • Offer some mechanism to exclude a chunk of HTML from parsing, so eg. the template's outermost div would have class="not-metadata". More maintainable than the previous option because you wouldn't have to update the source code every time a new template is found. Still feels like a weird approach to metadata parsing.

(Ideally, we'd use the structured data for this, which didn't yet exist at the time of CommonsMetadata's creation, but there hasn't been much progress on that front.)