Page MenuHomePhabricator

Productionize edit history extraction for all wikis using Sqoop
Closed, ResolvedPublic8 Estimated Story Points

Description

  • make a list of wikis we're processing
  • for each wiki, sqoop data for each important table
  • test a run and see if it needs to be throttled at all. (it took about 4 hours to sqoop enwiki)
  • code to be run ad-hoc, not on a cron quite yet

Pointing includes testing, likely will take 2 weeks.

Event Timeline

Nuria removed the point value for this task.
JAllemandou renamed this task from Productionize edit history extraction for all wikis to Productionize edit history extraction for all wikis using Sqoop.Jul 28 2016, 5:43 PM

I'll try to do this using new hotness oozie generator: https://github.com/etsy/arbiter

Change 303339 had a related patch set uploaded (by Milimetric):
[WIP] Oozify sqoop import of mediawiki tables

https://gerrit.wikimedia.org/r/303339

Milimetric triaged this task as Medium priority.Aug 8 2016, 4:52 PM
Nuria set the point value for this task to 21.

Change 303339 abandoned by Milimetric:
Oozify sqoop import of mediawiki tables

Reason:
oozie isn't really needed, will refactor to puppet as a cron job, maybe moving the python scripts to /bin?

https://gerrit.wikimedia.org/r/303339

Change 306292 had a related patch set uploaded (by Milimetric):
Script sqooping mediawiki tables into hdfs

https://gerrit.wikimedia.org/r/306292

Milimetric moved this task from Done to Ready to Deploy on the Analytics-Kanban board.
Milimetric moved this task from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Milimetric added a subscriber: mforns.

This is ready for review @Ottomata or @JAllemandou or @mforns

Milimetric changed the point value for this task from 21 to 8.Sep 15 2016, 3:56 PM

Change 306292 merged by Joal:
Script sqooping mediawiki tables into hdfs

https://gerrit.wikimedia.org/r/306292

Nuria updated the task description. (Show Details)