Page MenuHomePhabricator

Productionize edit history extraction for all wikis using Sqoop
Closed, ResolvedPublic8 Story Points

Description

  • make a list of wikis we're processing
  • for each wiki, sqoop data for each important table
  • test a run and see if it needs to be throttled at all. (it took about 4 hours to sqoop enwiki)
  • code to be run ad-hoc, not on a cron quite yet

Pointing includes testing, likely will take 2 weeks.

Event Timeline

Nuria created this task.Jul 27 2016, 7:29 PM
Nuria moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Jul 28 2016, 5:37 PM
Nuria removed the point value for this task.
JAllemandou renamed this task from Productionize edit history extraction for all wikis to Productionize edit history extraction for all wikis using Sqoop.Jul 28 2016, 5:43 PM

I'll try to do this using new hotness oozie generator: https://github.com/etsy/arbiter

Milimetric updated the task description. (Show Details)

Change 303339 had a related patch set uploaded (by Milimetric):
[WIP] Oozify sqoop import of mediawiki tables

https://gerrit.wikimedia.org/r/303339

Milimetric triaged this task as Normal priority.Aug 8 2016, 4:52 PM
Nuria updated the task description. (Show Details)Aug 11 2016, 4:36 PM
Nuria set the point value for this task to 21.

Change 303339 abandoned by Milimetric:
Oozify sqoop import of mediawiki tables

Reason:
oozie isn't really needed, will refactor to puppet as a cron job, maybe moving the python scripts to /bin?

https://gerrit.wikimedia.org/r/303339

Change 306292 had a related patch set uploaded (by Milimetric):
Script sqooping mediawiki tables into hdfs

https://gerrit.wikimedia.org/r/306292

Milimetric moved this task from Done to Ready to Deploy on the Analytics-Kanban board.
Milimetric moved this task from Ready to Deploy to In Code Review on the Analytics-Kanban board.
Milimetric added a subscriber: mforns.

This is ready for review @Ottomata or @JAllemandou or @mforns

Milimetric changed the point value for this task from 21 to 8.Sep 15 2016, 3:56 PM

Change 306292 merged by Joal:
Script sqooping mediawiki tables into hdfs

https://gerrit.wikimedia.org/r/306292

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Dec 19 2016, 8:12 PM
Nuria closed this task as Resolved.Dec 20 2016, 6:50 PM
Nuria updated the task description. (Show Details)