Page MenuHomePhabricator

Copy Wikidata dumps to HDFS + parquet
Closed, ResolvedPublic5 Estimated Story Points

Description

Now that T202489: Copy monthly XML files from public-dumps to HDFS is done, we'd love to see Wikidata dumps in HDFS. We already have one-off dumps in /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20180108. This task is about creating dumps periodically in an automated way.

One immediate use case is generating recommendations for article creation. We already have recommendations that are based on the above indicated dumps. But before going to production, we'd like to generate a new set of recommendations.

Event Timeline

Not sure when we'll be able to do that. However there is more recent dump available: /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001 :)

Thanks, @JAllemandou! The more recent dumps are very useful.

fdans triaged this task as Medium priority.Nov 19 2018, 5:18 PM
fdans lowered the priority of this task from Medium to Low.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

@Nuria we should fit this in somewhere! Maybe a Q3 goal? :D

Having missed most of goals this quarter due to our mw woes i think this might need to be moved to next quarter (q4?)

Having missed most of goals this quarter due to our mw woes i think this might need to be moved to next quarter (q4?)

@Nuria can your team help us with this task during Q4? Content Translation (currently the biggest user of the translation recommendation API) is aiming to go to production (in at least one language) in Q4: T102107 . We want to make sure the service is productionized by the time they move to production.

Most of the complicated things already exist for this to work (equicalent of rsync for HDFS, spark job converting wikidata json dumps to parquet).
I wanted for T216160 to be settled before moving into productionization (having the same date for the various dumps we handle simplifies quite a bit), and it takes time.

@abian : this is still not happening on a recurrent schedule yet.

@JAllemandou Thanks for the recent 20190603 dump copy in HDFS.

@GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;)

@JAllemandou Would it be possible to have another update (beyond the most recent 20190603) of the dump in hdfs?
I would like to present some of the analytical systems based on this in the WikidataCon 2019, and would be very, very grateful if a new copy in hdfs would appear until... say... October 15?
Please let me know. Many thanks!

this is done @GoranSMilovanovic.
Raw data is here /user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902 and parquet data is here /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902

Ottomata renamed this task from Copy Wikidata dumps to HDFs to Copy Wikidata dumps to HDFS.Nov 7 2019, 2:37 PM

@JAllemandou Do you think it would be possible to produce a new version of this data set?
The latest update seems to be: 2019-10-03 09:29 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902 - which you have pointed me at in T209655#5543575.
I would need to update the Wikidata Quality Report soon (Dec 15, say), and the code relies on Spark to process the dump. Thanks.

New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page.

hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1
drwxr-xr-x   - analytics joal          0 2019-12-04 18:31 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202

hdfs dfs -ls /user/joal/wmf/data/wmf/wikidata/item_page_link/ | tail -1
drwxr-xr-x   - joal joal          0 2019-12-04 18:50 /user/joal/wmf/data/wmf/wikidata/item_page_link/20191202

@JAllemandou Thank you - as ever!

+1: these wikidata parquet (specifically item_page_link) dumps are super useful for us!

Change 567954 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Add profile::analytics::refinery::job::import_wikidata_entites_dumps

https://gerrit.wikimedia.org/r/567954

JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to 5.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 569836 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add oozie job converting wikidata dumps to parquet

https://gerrit.wikimedia.org/r/569836

Change 346726 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Add spark code for wikidata json dumps parsing

https://gerrit.wikimedia.org/r/346726

Change 567954 merged by Elukey:
[operations/puppet@production] Add profile::analytics::refinery::job::import_wikidata_entites_dumps

https://gerrit.wikimedia.org/r/567954

Change 571237 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Correct profile::analytics::refinery::job::import_wikidata_entities_dumps

https://gerrit.wikimedia.org/r/571237

Change 571238 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Correct profile::analytics::refinery::job::import_wikidata_entities_dumps

https://gerrit.wikimedia.org/r/571238

Change 571237 abandoned by Joal:
Correct profile::analytics::refinery::job::import_wikidata_entities_dumps

Reason:
New patch at https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /571238/

https://gerrit.wikimedia.org/r/571237

Change 571238 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::import_wikidata_entities_dumps: fix typo

https://gerrit.wikimedia.org/r/571238

Change 571253 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] profile::analytics::refinery::job::import_wikidata_entities_dumps: cleanup

https://gerrit.wikimedia.org/r/571253

Change 571253 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::import_wikidata_entities_dumps: cleanup

https://gerrit.wikimedia.org/r/571253

Change 346726 merged by jenkins-bot:
[analytics/refinery/source@master] Add spark code for wikidata json dumps parsing

https://gerrit.wikimedia.org/r/346726

JAllemandou renamed this task from Copy Wikidata dumps to HDFS to Copy Wikidata dumps to HDFS + parquet.Feb 18 2020, 11:33 AM

Change 569836 merged by Fdans:
[analytics/refinery@master] Add oozie job converting wikidata dumps to parquet

https://gerrit.wikimedia.org/r/569836