Page MenuHomePhabricator

Read wiki dumps in Spark {hawk}
Closed, ResolvedPublic

Description

Aaron needs support to work the wiki dumps in hadoop.
He currently uses wikihadoop (which does not work with new versions of hadoop), and python though hadoop streaming.
The idea is to provide a more integrated apporach using Spark.

Event Timeline

JAllemandou claimed this task.
JAllemandou raised the priority of this task from to Low.
JAllemandou updated the task description. (Show Details)
JAllemandou subscribed.
kevinator raised the priority of this task from Low to High.Mar 12 2015, 3:28 PM
kevinator subscribed.

reprioritizing to high: prototyping revscoring [0] is a quarterly goal for the Research & Data team [1]

[0] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service
[1] https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals#Research_and_Data

kevinator renamed this task from Read wiki dumps in Spark to Read wiki dumps in Spark {hawk}.Mar 31 2015, 4:36 AM
kevinator set Security to None.
kevinator lowered the priority of this task from High to Medium.Jun 3 2015, 8:37 PM

I just talked to @JAllemandou and it looks like he's gone as far as he can with spark. We're able to read and extract JSON from XML dumps at high speed, but we're not able to use Spark to do the page-level metric extraction.

There's a followup task (https://phabricator.wikimedia.org/T108684) to implement sorting in the JSON extractor so that we can do that once and be done with it.