Page MenuHomePhabricator

Create a Commons equivalent of the wikidata_entity table in the Data Lake
Closed, ResolvedPublic

Description

I recently learned that there's a table in the Data Lake that contains all Wikidata entities called wmf.wikidata_entity, and it's documented here on Wikitech.

Having a similar table for Commons would be beneficial as it would allow us to use the Data Lake to query the Structured Data on Commons. This could for instance drive dashboards for SDC development over time (T252443). There might be other use cases as well.

Event Timeline

From my understanding of the documentation of wmf.wikidata_entity, it's based on the JSON dump of the data. For Commons we currently have RDF data dumps available (per T221917). Looks like we might need a subtask to create JSON dumps for Commons, similar to what was done for Wikidata in T56369?

@Miriam : If I remember correctly, this kind of table would be useful for your work. Could you add some use cases to the task description so we know more about how it would be used?

Putting this in radar until the json dump is created for commons.

Thanks @Morten for opening this task!
A few use cases below:

  • In our work on image recommendation for unillustrated articles: T256081, we discover a set of potential image candidates based page and image links. Here, structured data annotations can help enrich the pool of image candidates for a given page, given its corresponding Wikidata ID.
  • For our projects related to Commons-based image classifiers for object recognition, structured data annotation can be used to expand training set or for validation purposes: T228441
  • In our research about understanding readers' engagement with images structured data annotations can help understand the role of the pictures' content for readers' engagement with images.

Thanks!

LGoto triaged this task as Medium priority.
LGoto moved this task from Triage to Needs Investigation on the Product-Analytics board.
LGoto lowered the priority of this task from Medium to Low.Aug 17 2020, 4:42 PM
LGoto moved this task from Needs Investigation to Backlog on the Product-Analytics board.
nettrom_WMF raised the priority of this task from Low to Needs Triage.Mar 18 2021, 11:20 PM
nettrom_WMF edited projects, added Analytics; removed Analytics-Radar.
nettrom_WMF moved this task from Backlog to Tracking on the Product-Analytics board.

Moving this back to Analytics now that the dump exists, and changed the priority so the team can triage it as they see fit.

Milimetric triaged this task as Medium priority.Mar 22 2021, 3:23 PM
Milimetric moved this task from Incoming to Datasets on the Analytics board.
JAllemandou added subscribers: cchen, JAllemandou.

Moving back to incoming as there is demand from @cchen to prioritize.

JAllemandou raised the priority of this task from Medium to Needs Triage.Oct 21 2021, 5:08 PM
Gehel triaged this task as High priority.Nov 1 2021, 3:06 PM
Gehel moved this task from Incoming to Analysis on the Wikidata-Query-Service board.

Change 738874 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Import commons mediainfo json dumps to HDFS

https://gerrit.wikimedia.org/r/738874

Change 738874 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Import commons mediainfo json dumps to HDFS

https://gerrit.wikimedia.org/r/738874

Change 739129 had a related patch set uploaded (by AKhatun; author: AKhatun):

[analytics/refinery/source@master] Save commons json dumps as a table

https://gerrit.wikimedia.org/r/739129

Change 739129 merged by jenkins-bot:

[analytics/refinery/source@master] Save commons json dumps as a table and add fields for wikidata

https://gerrit.wikimedia.org/r/739129

Change 740590 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add structured_data.commons_entity table create

https://gerrit.wikimedia.org/r/740590

Change 747508 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Update structured_data dumps parsing job

https://gerrit.wikimedia.org/r/747508

Code is ready:

What we need after having merged/deployed the above is:

  • A new airflow job for the commons_entity data genration
  • A migration of the wikidata_entity oozie job to Airflow

Change 738874 merged by Ottomata:

[operations/puppet@production] Import commons mediainfo json dumps to HDFS

https://gerrit.wikimedia.org/r/738874

Change 747508 merged by jenkins-bot:

[analytics/refinery/source@master] Update structured_data dumps parsing job

https://gerrit.wikimedia.org/r/747508

Change 740590 merged by Joal:

[analytics/refinery@master] Add structured_data.commons_entity table create

https://gerrit.wikimedia.org/r/740590