Page MenuHomePhabricator

SPIKE: Research and document how Commons retrieves Attribution Information
Closed, ResolvedPublic3 Estimated Story PointsSpike

Description

Wikimedia Commons displays attribution details for media files, including licensing, author, and uploader information.
For example, visiting this Commons file page shows a Licensing section. Selecting “Use this file” opens a modal displaying attribution details (Author and License):

image.png (478×874 px, 308 KB)
.

Goals:

  • Set up a local instance of a “Commons-like” wiki environment.
  • Identify the mechanism used to fetch attribution information.
  • Determine whether retrieving the data requires parsing the full article content.
  • Evaluate whether attribution logic should be centralized in WikimediaCustomizations, or if it can be implemented directly within the Attribution module ( in other words, how much logic would need to duplicate in both Commons and Customizations )

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptTue, Feb 17, 5:11 PM
BPirkle set the point value for this task to 3.Thu, Feb 19, 4:18 PM

Mechanisms Commons uses to fetch attribution information

  • Uploader-Provided information. Commons uses the UploadWizard Extension which on upload requires the uploader to fill in some Attribution fields like Author, Licence, Source. This data is stored in the file's page and rendered in the file's view page e.g .
  • Structured Data(Wikibase). In some cases Commons uses Structured data built on WikiBase. Compared to the uploader information, this allows attribution details to be stored as machine readable statements which makes the attribution searchable and reusable via API.
  • MediaWiki Action api: For the specific screenshot provided for this task on viewing a file's page on Commons, apart from the file's saved information there are calls to Mediawiki's Action API specifically the imageinfo module.. there are other Wikibase specific action api calls for action wbformatvalue but I've not found those responses to be significant for attribution.

TL;DR: Commons displays what uploaders and later editors have entered(and editted) in the file's description fields and where available, structured metadata linked to the file. Commons Action api with imageinfo with extmetadata can be used to programmatically access this.

Notes from demo:

  • wbformatvalue might be a red herring; there is Commons structured data within a custom Wikibase instance. However, it's unclear how well supported/covered the structured data is.
    • Potentially follow up: What % of images have structured data? What % of images have the author and license info that we care about? (probably small?)

Next steps

  • Consolidate learnings with other research spike tickets: https://phabricator.wikimedia.org/T417585
  • Determine actual % coverage for author and license within structured data, so that we can make a more well informed decision about next steps (@pmiazga )
  • Create follow up tickets as needed

After a quick check on commons-query.wikimedia.org - we found out that like
~44K files has the author - P50 prop
~52K files has the performer - P175 prop.

Therfore in structured data on Commons we have less than 1% information regarding the author/performer.

But there is an interesting bit, the P170 - creator is quite popular, haswbstatement:P170 returns ~77M results - https://commons.wikimedia.org/w/index.php?search=haswbstatement%3AP170&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&ns6=1&ns12=1&ns14=1&ns100=1&ns106=1 - and most likely this could serve the purpose - it's like 57% of all results.

It's a bit better when it comes to licensing as ~91M files have P275 property ( license ).
At this moment, Commons has ~135M files, so license coverage is good ~67%.

Note: Counts vary a couple percent across SPARQL/CirrusSearch as approximations, not exact totals.

Note to myself: to query Commons we need to run sparql queries on https://commons-query.wikimedia.org/ - not where Commons links to ( eg WikiData ). But Sparql quite often timeouts, and we can search using cirrussearch, just by searching for strings like haswbstatement:P170.

Also, big thanks to @EBernhardson for helping with queries.