I want to look to wikidata for some structured information about wikipedia articles. This kind of task will be more efficient if the mapping between wikidata and wikipedia page ids are available somewhere in the datalake. Otherwise I'll have to make a lot of queries or do a big join against the page_props tables in the various language edition databases. So it would be a big help to move page_props tables into the data lake. I (Nate) am willing to work on this to help it happen faster, but I'll need help getting started.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T212172 Provide feature parity between the wiki replicas and the Analytics Data Lake | |||
Resolved | None | T221890 Add wikidata ids to data lake tables |
Event Timeline
Requested noted, the easiest that we could do for now would be to sqoop the page_props tables so they are available in hadoop, We are working on other changes on the April snapshot so the earliest we could get to this request is in a month from now (for the May snapshot.), we will update ticket with more details.
Thank you Nuria. Are you saying that we'll be able to sqoop the prop_tables in May at the earliest? Would it be okay to lookup a sizable number of pages in the prop_tables in the meantime? I'm thinking on the order of 20,000 pages per language.
@Groceryheist can you elaborate a bit on what exactly you're trying to accomplish? We're trying to understand exactly what we have to do here.
Hi @fdans
My ultimate goal is to identify, from a random sample of ~500,000 to ~50,000,000 edits from different language Wikipedias.
- Which edits are to biographical articles.
- The gender or sex of the subject of the biographical articles.
At this point I'm still looking for the best possible approach to this, and I see Wikidata as one option that could have good enough precision, if limited recall. So technically, I just want to map pages (especially pages as they existed at the time of the edit) to Wikidata ids and then lookup the Wikidata entries to see if they have the information I'm looking for.
Thanks for elaborating @Groceryheist :) We have a non-productionized version of the item_id --> wiki_db/page_id dataset built for research:
val wd = spark.read.parquet("/user/joal/wmf/data/wmf/wikidata/item_page_link/20190204") wd: org.apache.spark.sql.DataFrame = [item_id: string, wiki_db: string ... 4 more fields] scala> wd.printSchema root |-- item_id: string (nullable = true) |-- wiki_db: string (nullable = true) |-- page_id: long (nullable = true) |-- page_title: string (nullable = true) |-- page_namespace: integer (nullable = true) |-- title_namespace_localized: string (nullable = true)
The link between item_id and wiki_db/page_id is made through page_title/namepsace using the site_link data of wikidata dumps.
With a list of wikidata_items (or a query allowing to get them in spark, I can help with that if needed), and the item_page_link data and the mediawiki_history table, I think you'll get the data you're after :)
The only doubt I have is whether the item_id --> page_id link is the one link you're after, but from the description you gave it seems plausible.
Please ping me (joal) on the analytics chan to talk :)
I'm not available to work on this, @JAllemandou's data served my purpose but it seems like there was some interest in maintaining a table like this.
Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).