Page MenuHomePhabricator

Add wikidata ids to data lake tables
Closed, ResolvedPublic

Description

I want to look to wikidata for some structured information about wikipedia articles. This kind of task will be more efficient if the mapping between wikidata and wikipedia page ids are available somewhere in the datalake. Otherwise I'll have to make a lot of queries or do a big join against the page_props tables in the various language edition databases. So it would be a big help to move page_props tables into the data lake. I (Nate) am willing to work on this to help it happen faster, but I'll need help getting started.

Event Timeline

Requested noted, the easiest that we could do for now would be to sqoop the page_props tables so they are available in hadoop, We are working on other changes on the April snapshot so the earliest we could get to this request is in a month from now (for the May snapshot.), we will update ticket with more details.

Thank you Nuria. Are you saying that we'll be able to sqoop the prop_tables in May at the earliest? Would it be okay to lookup a sizable number of pages in the prop_tables in the meantime? I'm thinking on the order of 20,000 pages per language.

@Groceryheist can you elaborate a bit on what exactly you're trying to accomplish? We're trying to understand exactly what we have to do here.

fdans triaged this task as Medium priority.Apr 29 2019, 3:28 PM
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Hi @fdans

My ultimate goal is to identify, from a random sample of ~500,000 to ~50,000,000 edits from different language Wikipedias.

  1. Which edits are to biographical articles.
  2. The gender or sex of the subject of the biographical articles.

At this point I'm still looking for the best possible approach to this, and I see Wikidata as one option that could have good enough precision, if limited recall. So technically, I just want to map pages (especially pages as they existed at the time of the edit) to Wikidata ids and then lookup the Wikidata entries to see if they have the information I'm looking for.

"map pages to wikidata ids"

Is page_props teh table that holds this information?

@Nuria yes. My understanding is that they are when pp_propname == "wikibase_item"

Thanks for elaborating @Groceryheist :) We have a non-productionized version of the item_id --> wiki_db/page_id dataset built for research:

val wd = spark.read.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204")
wd: org.apache.spark.sql.DataFrame = [item_id: string, wiki_db: string ... 4 more fields]

scala> wd.printSchema
root
 |-- item_id: string (nullable = true)
 |-- wiki_db: string (nullable = true)
 |-- page_id: long (nullable = true)
 |-- page_title: string (nullable = true)
 |-- page_namespace: integer (nullable = true)
 |-- title_namespace_localized: string (nullable = true)

The link between item_id and wiki_db/page_id is made through page_title/namepsace using the site_link data of wikidata dumps.
With a list of wikidata_items (or a query allowing to get them in spark, I can help with that if needed), and the item_page_link data and the mediawiki_history table, I think you'll get the data you're after :)
The only doubt I have is whether the item_id --> page_id link is the one link you're after, but from the description you gave it seems plausible.
Please ping me (joal) on the analytics chan to talk :)

Groceryheist changed the task status from Open to Stalled.May 26 2021, 5:09 PM

I'm not available to work on this, @JAllemandou's data served my purpose but it seems like there was some interest in maintaining a table like this.

Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).

Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).

Sounds like this need has actually been satisfied.