Page MenuHomePhabricator

Copy Wikidata dumps to HDFS
Open, LowPublic


Now that T202489: Copy monthly XML files from public-dumps to HDFS is done, we'd love to see Wikidata dumps in HDFS. We already have one-off dumps in /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20180108. This task is about creating dumps periodically in an automated way.

One immediate use case is generating recommendations for article creation. We already have recommendations that are based on the above indicated dumps. But before going to production, we'd like to generate a new set of recommendations.

Event Timeline

Not sure when we'll be able to do that. However there is more recent dump available: /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001 :)

Addshore added a subscriber: Addshore.
Addshore moved this task from incoming to monitoring on the Wikidata board.Nov 16 2018, 8:30 AM

Thanks, @JAllemandou! The more recent dumps are very useful.

fdans triaged this task as Medium priority.Nov 19 2018, 5:18 PM
fdans lowered the priority of this task from Medium to Low.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

@Nuria we should fit this in somewhere! Maybe a Q3 goal? :D

Nuria added a comment.Dec 6 2018, 6:28 PM

Having missed most of goals this quarter due to our mw woes i think this might need to be moved to next quarter (q4?)

leila added a subscriber: leila.Mar 26 2019, 6:51 PM

Having missed most of goals this quarter due to our mw woes i think this might need to be moved to next quarter (q4?)

@Nuria can your team help us with this task during Q4? Content Translation (currently the biggest user of the translation recommendation API) is aiming to go to production (in at least one language) in Q4: T102107 . We want to make sure the service is productionized by the time they move to production.

Most of the complicated things already exist for this to work (equicalent of rsync for HDFS, spark job converting wikidata json dumps to parquet).
I wanted for T216160 to be settled before moving into productionization (having the same date for the various dumps we handle simplifies quite a bit), and it takes time.

abian added a subscriber: abian.Apr 24 2019, 10:07 PM

@JAllemandou, do you think this is now unblocked?

@abian : this is still not happening on a recurrent schedule yet.

@JAllemandou Thanks for the recent 20190603 dump copy in HDFS.

@GoranSMilovanovic : You're welcome :) At some point I'll manage to have that productionize ;)

leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 4:08 PM

@JAllemandou Would it be possible to have another update (beyond the most recent 20190603) of the dump in hdfs?
I would like to present some of the analytical systems based on this in the WikidataCon 2019, and would be very, very grateful if a new copy in hdfs would appear until... say... October 15?
Please let me know. Many thanks!

this is done @GoranSMilovanovic.
Raw data is here /user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902 and parquet data is here /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902

Ottomata renamed this task from Copy Wikidata dumps to HDFs to Copy Wikidata dumps to HDFS.Nov 7 2019, 2:37 PM
leila moved this task from Backlog to Radar on the Research-Backlog board.Nov 20 2019, 12:30 AM

@JAllemandou Do you think it would be possible to produce a new version of this data set?
The latest update seems to be: 2019-10-03 09:29 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902 - which you have pointed me at in T209655#5543575.
I would need to update the Wikidata Quality Report soon (Dec 15, say), and the code relies on Spark to process the dump. Thanks.

New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also generated the items per page.

hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1
drwxr-xr-x   - analytics joal          0 2019-12-04 18:31 /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20191202

hdfs dfs -ls /user/joal/wmf/data/wmf/wikidata/item_page_link/ | tail -1
drwxr-xr-x   - joal joal          0 2019-12-04 18:50 /user/joal/wmf/data/wmf/wikidata/item_page_link/20191202
Isaac added a subscriber: Isaac.Tue, Jan 14, 6:46 PM

@JAllemandou Thank you - as ever!

+1: these wikidata parquet (specifically item_page_link) dumps are super useful for us!