Read wiki dumps in Spark {hawk}
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	Mar 11 2015, 2:35 PM

Description

Aaron needs support to work the wiki dumps in hadoop.
He currently uses wikihadoop (which does not work with new versions of hadoop), and python though hadoop streaming.
The idea is to provide a more integrated apporach using Spark.

Event Timeline

JAllemandou created this task.Mar 11 2015, 2:35 PM

JAllemandou claimed this task.

JAllemandou raised the priority of this task from to Low.

JAllemandou updated the task description. (Show Details)

JAllemandou added projects: Analytics-Kanban, Analytics-Clusters.

JAllemandou subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2015, 2:35 PM

reprioritizing to high: prototyping revscoring [0] is a quarterly goal for the Research & Data team [1]

[0] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service
[1] https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals#Research_and_Data

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 30 2015, 8:56 PM

• kevinator renamed this task from Read wiki dumps in Spark to Read wiki dumps in Spark {hawk}.Mar 31 2015, 4:36 AM

• kevinator set Security to None.

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Apr 20 2015, 6:34 PM

• ggellerman moved this task from In Code Review to Paused on the Analytics-Kanban board.May 11 2015, 3:39 PM

• kevinator lowered the priority of this task from High to Medium.Jun 3 2015, 8:37 PM

JAllemandou moved this task from Paused to In Progress on the Analytics-Kanban board.Jul 3 2015, 3:20 PM

JAllemandou moved this task from In Progress to Paused on the Analytics-Kanban board.Jul 9 2015, 3:40 PM

I just talked to @JAllemandou and it looks like he's gone as far as he can with spark. We're able to read and extract JSON from XML dumps at high speed, but we're not able to use Spark to do the page-level metric extraction.

There's a followup task (https://phabricator.wikimedia.org/T108684) to implement sorting in the JSON extractor so that we can do that once and be done with it.

Milimetric moved this task from Paused to Done on the Analytics-Kanban board.Aug 11 2015, 3:33 PM

• kevinator closed this task as Resolved.Aug 12 2015, 3:33 PM

Read wiki dumps in Spark {hawk}Closed, ResolvedPublicActions

Description

Event Timeline

Read wiki dumps in Spark {hawk}
Closed, ResolvedPublic
Actions