Add wikidata ids to data lake tables
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Groceryheist
	Apr 25 2019, 7:50 PM

Description

I want to look to wikidata for some structured information about wikipedia articles. This kind of task will be more efficient if the mapping between wikidata and wikipedia page ids are available somewhere in the datalake. Otherwise I'll have to make a lot of queries or do a big join against the page_props tables in the various language edition databases. So it would be a big help to move page_props tables into the data lake. I (Nate) am willing to work on this to help it happen faster, but I'll need help getting started.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T212172 Provide feature parity between the wiki replicas and the Analytics Data Lake
		Resolved		None	T221890 Add wikidata ids to data lake tables

Event Timeline

Groceryheist created this task.Apr 25 2019, 7:50 PM

Groceryheist updated the task description. (Show Details)

Requested noted, the easiest that we could do for now would be to sqoop the page_props tables so they are available in hadoop, We are working on other changes on the April snapshot so the earliest we could get to this request is in a month from now (for the May snapshot.), we will update ticket with more details.

Thank you Nuria. Are you saying that we'll be able to sqoop the prop_tables in May at the earliest? Would it be okay to lookup a sizable number of pages in the prop_tables in the meantime? I'm thinking on the order of 20,000 pages per language.

@Groceryheist can you elaborate a bit on what exactly you're trying to accomplish? We're trying to understand exactly what we have to do here.

• fdans triaged this task as Medium priority.Apr 29 2019, 3:28 PM

• fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Hi @fdans

My ultimate goal is to identify, from a random sample of ~500,000 to ~50,000,000 edits from different language Wikipedias.

Which edits are to biographical articles.
The gender or sex of the subject of the biographical articles.

At this point I'm still looking for the best possible approach to this, and I see Wikidata as one option that could have good enough precision, if limited recall. So technically, I just want to map pages (especially pages as they existed at the time of the edit) to Wikidata ids and then lookup the Wikidata entries to see if they have the information I'm looking for.

"map pages to wikidata ids"

Is page_props teh table that holds this information?

@Nuria yes. My understanding is that they are when pp_propname == "wikibase_item"

Thanks for elaborating @Groceryheist :) We have a non-productionized version of the item_id --> wiki_db/page_id dataset built for research:

val wd = spark.read.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204")
wd: org.apache.spark.sql.DataFrame = [item_id: string, wiki_db: string ... 4 more fields]

scala> wd.printSchema
root
 |-- item_id: string (nullable = true)
 |-- wiki_db: string (nullable = true)
 |-- page_id: long (nullable = true)
 |-- page_title: string (nullable = true)
 |-- page_namespace: integer (nullable = true)
 |-- title_namespace_localized: string (nullable = true)

The link between item_id and wiki_db/page_id is made through page_title/namepsace using the site_link data of wikidata dumps.
With a list of wikidata_items (or a query allowing to get them in spark, I can help with that if needed), and the item_page_link data and the mediawiki_history table, I think you'll get the data you're after :)
The only doubt I have is whether the item_id --> page_id link is the one link you're after, but from the description you gave it seems plausible.
Please ping me (joal) on the analytics chan to talk :)

nshahquinn-wmf moved this task from Triage to Tracking on the Product-Analytics board.May 2 2019, 8:57 PM

mpopov awarded a token.May 3 2019, 2:34 PM

elukey removed a project: User-Elukey.Jul 5 2019, 9:47 AM

mforns mentioned this in T224459: Recommend the best format to release public data lake as a dump.Sep 13 2019, 3:20 PM

I'm not available to work on this, @JAllemandou's data served my purpose but it seems like there was some interest in maintaining a table like this.

Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).

Milimetric moved this task from Smart Tools for Better Data to Deprioritized on the Analytics board.Jun 3 2021, 3:37 PM

In T221890#7117063, @JAllemandou wrote:

Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).

Sounds like this need has actually been satisfied.

Add wikidata ids to data lake tablesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add wikidata ids to data lake tables
Closed, ResolvedPublic
Actions

Related Objects
Search...