The airflow job should
- be run weekly on Mondays.
- wait for source data to be available:
- source folder is of form hdfs://analytics-hadoop/wmf/data/raw/commons/dumps/mediainfo-json/YYYYMMDD
- source folder contains a file named _IMPORTED when the source data has been succesfully imported in the folder
- run a spark job reading the source data and writing it to hive
- the spark job is in the refinery-job.jar archive, we need to have it as a dependency for the job
- the spark job class is org.wikimedia.analytics.refinery.job.structureddata.jsonparse.JsonDumpConverter
- main parameters of the job are the input folder, the output hive table and the snapshot (time partition) being created. The output hive table will be structured_data.commons_entity and the snapshot will be in the form YYYY-MM-DD. See the class for the detailed list of parameters :)