enwiki.revision, for example, failed to import properly.
|Resolved||None||T161147 Provide cumulative edit count in Data Lake edit data|
|Resolved||Milimetric||T169782 Troubleshoot issues with sqoop of data not working for big tables|
Diagnosis so far:
It looks like Joseph did something close to the following sequence of actions:
- add two tables to the sqoop script
- update the ORM jar
- merge and deploy these changes
- realize there were FileAlreadyExistsException exceptions in the log
- fix this by getting sqoop to delete if it had to retry
- try to run the sqoop manually
- find the problem described in T169735
So I don't think the error is related to any of the patches, it's simply a timeout error. I'm going to run a test overnight and see how that goes, and take a look at the logs in detail tomorrow.
more context: it looks like Bob West ran into the same issue a couple of years ago while reading from analytics-store: https://wm-bot.wmflabs.org/logs/%23wikimedia-databases/20150622.txt. In that case it sounded like the server was just exceptionally busy. Based on that, I suppose we can try sqooping again manually and see if we get a better result.
@Milimetric: Thanks for clarifying context, I confirm the sequence you rebuilt.
Some more info: I tried a few manual runs against enwiki.revision --> They all failed.
@Milimetric: I end up thinking we'll have to split the sqoop jobs for big tables, but I don't have a good solution (multiple mappers in sqoop with resource restriction? Manual split in python? Don't know ...).
Thanks very much, Joseph, no worries, I got it.
Note to self: check that puppet running the cron command is actually formatting the date properly, and not like this:
select ... from revision where (1 = 0) and rev_timestamp <= '\2017\0701000000'
It looks like the split-by parallelization (the easy option) worked!
milimetric@analytics1003:~/refinery$ hdfs dfs -ls wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db\=enwiki/ Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Found 5 items -rw-r--r-- 3 milimetric milimetric 0 2017-07-06 22:28 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/_SUCCESS -rw-r--r-- 3 milimetric milimetric 11516266876 2017-07-06 22:21 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00000.avro -rw-r--r-- 3 milimetric milimetric 12557481065 2017-07-06 22:26 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00001.avro -rw-r--r-- 3 milimetric milimetric 12836223771 2017-07-06 22:27 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00002.avro -rw-r--r-- 3 milimetric milimetric 12942402964 2017-07-06 22:28 wmf/data/raw/mediawiki/tables/revision/snapshot=2017-06.dan.1/wiki_db=enwiki/part-m-00003.avro milimetric@analytics1003:~/refinery$
That's what's being added in the patch here, but I assume it might need a closer look. Adding @mforns to review.
@Milimetric : This is awesome :)
Something to keep in mind(and possibly an option to add to the job) : the "Number of mappers" you give to sqoop would theoretically be run in parallel on the cluster. In the example you tried, you asked for 4 mappers, which is small enough not break the database, but if someone tries with 100, it could definitely overload the DB (actually it would fail the jobs because of too many connections, but this is just for clarity).
We should use the "mapreduce.jobtracker.maxtasks.perjob" that defines the maximum number of tasks a job can run in parallel (in our case, given we already parallelize sqoop jobs in production, we should probably use 1 for production and more with max 10 for manual runs.