sqoop by default will query the database it's hooked up to in order to generate a bunch of Java files and transfer data. But it also takes a parameter to use a pre-generated jar. If we pass this it will make our sqoop jobs faster.
Description
Details
Event Timeline
Oh yeah, it's relevant even if we run sqoop once, because for every table in every database it repeats the column detection process (so like 5000 times every run). I should have probably not skipped it in the first place but I was afraid I'd find different schemas on different dbs (which is actually the case) so I didn't want to hold up the sqooping task any longer.
- we need a jar with bindings for the MySQL schema. How we generate this?
- we could run script with one parameter --generate-jar or do this automatically when we deploy refinery, jars would need to be somewhere where script can find them at runtime. Script needs to be run on 1002 to be able to generate bindings from MySQL
- changing scoop job to have a parameter that passes jar along
- passing jar to scoop job that will generate ORM code
Ping @Milimetric lower priority than our design but if you feel you ned to grab an item you could do this one
Change 349723 had a related patch set uploaded (by Milimetric):
[analytics/refinery@master] [WIP] Add just-generate-jar and jar-file options
Change 349723 merged by Ottomata:
[analytics/refinery@master] Add --generate-jar and --jar-file options
Change 351667 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] Add README.mediawiki-tables-sqoop-orm
Change 351857 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Sqoop using the pre-generated orm jar
Change 351857 merged by Elukey:
[operations/puppet@production] Sqoop using the pre-generated orm jar
Change 351667 merged by Ottomata:
[analytics/refinery@master] Add README.mediawiki-tables-sqoop-orm