Page MenuHomePhabricator

Make Wikidata item_page_link table available publicly
Open, MediumPublic

Description

It's quite difficult without access to the private HDFS item_page_link table to map Wikidata IDs to article page IDs en masse. Specifically, for an outside researcher who wanted to map Wikidata IDs to page IDs, they would have to either:

  • Make API calls to wbgetentities to get sitelinks for each QID and then make calls to the respective language APIs to map those page titles to page IDs. This would only work efficiently in practice for a small number of QIDs.
  • Parse the Wikidata JSON dump (30 GB compressed) to extract the desired sitelinks and then parse each language's page table to get the mapping of page title -> page ID. This would take a very long time or might not even be possible given resource constraints.

This is very easy to do with the item_page_link table though which contains no private data and can be outputted into a relatively small TSV (or something similar) that contains all the information necessary to map any Wikidata ID to its respective page ID in any language.

See https://lists.wikimedia.org/pipermail/wiki-research-l/2020-July/007315.html for a recent request that this dataset would help with.

Event Timeline

Isaac created this task.Jul 21 2020, 5:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2020, 5:42 PM
fdans triaged this task as Medium priority.Aug 3 2020, 4:16 PM
fdans moved this task from Incoming to Datasets on the Analytics board.