Aaron needs support to work the wiki dumps in hadoop.
He currently uses wikihadoop (which does not work with new versions of hadoop), and python though hadoop streaming.
The idea is to provide a more integrated apporach using Spark.
Description
Description
Event Timeline
Comment Actions
reprioritizing to high: prototyping revscoring [0] is a quarterly goal for the Research & Data team [1]
[0] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service
[1] https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals#Research_and_Data
Comment Actions
I just talked to @JAllemandou and it looks like he's gone as far as he can with spark. We're able to read and extract JSON from XML dumps at high speed, but we're not able to use Spark to do the page-level metric extraction.
There's a followup task (https://phabricator.wikimedia.org/T108684) to implement sorting in the JSON extractor so that we can do that once and be done with it.