Page MenuHomePhabricator

Productionize edit history extraction for all wikis using Sqoop
Closed, ResolvedPublic8 Estimated Story Points

Description

  • make a list of wikis we're processing
  • for each wiki, sqoop data for each important table
  • test a run and see if it needs to be throttled at all. (it took about 4 hours to sqoop enwiki)
  • code to be run ad-hoc, not on a cron quite yet

Pointing includes testing, likely will take 2 weeks.

Event Timeline

Nuria removed the point value 0 for this task.
JAllemandou renamed this task from Productionize edit history extraction for all wikis to Productionize edit history extraction for all wikis using Sqoop.Jul 28 2016, 5:43 PM

Change 303339 had a related patch set uploaded (by Milimetric):
[WIP] Oozify sqoop import of mediawiki tables

https://gerrit.wikimedia.org/r/303339

Milimetric triaged this task as Medium priority.Aug 8 2016, 4:52 PM
Nuria set the point value for this task to 21.

Change 303339 abandoned by Milimetric:
Oozify sqoop import of mediawiki tables

Reason:
oozie isn't really needed, will refactor to puppet as a cron job, maybe moving the python scripts to /bin?

https://gerrit.wikimedia.org/r/303339

Change 306292 had a related patch set uploaded (by Milimetric):
Script sqooping mediawiki tables into hdfs

https://gerrit.wikimedia.org/r/306292

Milimetric moved this task from Done to Ready to Deploy on the Analytics-Kanban board.
Milimetric moved this task from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Milimetric added a subscriber: mforns.

This is ready for review @Ottomata or @JAllemandou or @mforns

Milimetric changed the point value for this task from 21 to 8.Sep 15 2016, 3:56 PM

Change 306292 merged by Joal:
Script sqooping mediawiki tables into hdfs

https://gerrit.wikimedia.org/r/306292

Nuria updated the task description. (Show Details)