Page MenuHomePhabricator

Troubleshoot issues with sqoop of data not working for big tables
Closed, ResolvedPublic8 Estimated Story Points

Description

enwiki.revision, for example, failed to import properly.

Event Timeline

Nuria created this task.Jul 5 2017, 5:58 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 5 2017, 5:58 PM
Nuria renamed this task from Troubleshoot issues with scoop of data not working to Troubleshoot issues with scoop of data not working for data lake data with addition of two new tables.Jul 5 2017, 5:58 PM
Nuria reassigned this task from JAllemandou to Milimetric.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Jul 5 2017, 5:59 PM
Nuria changed the point value for this task from 21 to 8.
Milimetric renamed this task from Troubleshoot issues with scoop of data not working for data lake data with addition of two new tables to Troubleshoot issues with sqoop of data not working for data lake data with addition of two new tables.Jul 5 2017, 9:06 PM

Diagnosis so far:

It looks like Joseph did something close to the following sequence of actions:

  • add two tables to the sqoop script
  • update the ORM jar
  • merge and deploy these changes
  • realize there were FileAlreadyExistsException exceptions in the log
  • fix this by getting sqoop to delete if it had to retry
  • try to run the sqoop manually
  • find the problem described in T169735

So I don't think the error is related to any of the patches, it's simply a timeout error. I'm going to run a test overnight and see how that goes, and take a look at the logs in detail tomorrow.

Milimetric renamed this task from Troubleshoot issues with sqoop of data not working for data lake data with addition of two new tables to Troubleshoot issues with sqoop of data not working for big tables.Jul 5 2017, 9:32 PM
Milimetric updated the task description. (Show Details)
Milimetric added a subscriber: elukey.

more context: it looks like Bob West ran into the same issue a couple of years ago while reading from analytics-store: https://wm-bot.wmflabs.org/logs/%23wikimedia-databases/20150622.txt. In that case it sounded like the server was just exceptionally busy. Based on that, I suppose we can try sqooping again manually and see if we get a better result.

JAllemandou added a comment.EditedJul 6 2017, 8:36 AM

@Milimetric: Thanks for clarifying context, I confirm the sequence you rebuilt.

Some more info: I tried a few manual runs against enwiki.revision --> They all failed.
@Milimetric: I end up thinking we'll have to split the sqoop jobs for big tables, but I don't have a good solution (multiple mappers in sqoop with resource restriction? Manual split in python? Don't know ...).

Thanks very much, Joseph, no worries, I got it.

Note to self: check that puppet running the cron command is actually formatting the date properly, and not like this:

<code>

select ...

  from revision
 where  (1 = 0) 
   and rev_timestamp <= '\2017\0701000000'

</code>

Change 363735 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Implement sqooping with mappers > 1

https://gerrit.wikimedia.org/r/363735

Milimetric added a subscriber: mforns.

It looks like the split-by parallelization (the easy option) worked!

milimetric@analytics1003:~/refinery$ hdfs dfs -ls wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db\=enwiki/
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 5 items
-rw-r--r--   3 milimetric milimetric           0 2017-07-06 22:28 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/_SUCCESS
-rw-r--r--   3 milimetric milimetric 11516266876 2017-07-06 22:21 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00000.avro
-rw-r--r--   3 milimetric milimetric 12557481065 2017-07-06 22:26 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00001.avro
-rw-r--r--   3 milimetric milimetric 12836223771 2017-07-06 22:27 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00002.avro
-rw-r--r--   3 milimetric milimetric 12942402964 2017-07-06 22:28 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00003.avro
milimetric@analytics1003:~/refinery$

That's what's being added in the patch here, but I assume it might need a closer look. Adding @mforns to review.

Change 363846 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Use parallelism to sqoop large tables

https://gerrit.wikimedia.org/r/363846

@Milimetric : This is awesome :)
Something to keep in mind(and possibly an option to add to the job) : the "Number of mappers" you give to sqoop would theoretically be run in parallel on the cluster. In the example you tried, you asked for 4 mappers, which is small enough not break the database, but if someone tries with 100, it could definitely overload the DB (actually it would fail the jobs because of too many connections, but this is just for clarity).
We should use the "mapreduce.jobtracker.maxtasks.perjob" that defines the maximum number of tasks a job can run in parallel (in our case, given we already parallelize sqoop jobs in production, we should probably use 1 for production and more with max 10 for manual runs.

Change 363735 merged by Milimetric:
[analytics/refinery@master] Implement sqooping with mappers > 1

https://gerrit.wikimedia.org/r/363735

Change 363846 merged by Elukey:
[operations/puppet@production] Use parallelism to sqoop large tables

https://gerrit.wikimedia.org/r/363846

Nuria closed this task as Resolved.Jul 12 2017, 7:16 PM