We're having trouble with any direct Hadoop integration, so the next approach will be to write to a simple file, and then ingest that file with an Spark Airflow task in the same job.
- Adapt the EventGate output stage to write JSON to a file.
- The destination path is passed into the program as a command-line argument.
- Run the scraper as an Airflow SimpleSkeinOperator with the output directory parameter set to a temporary directory which will have several GB available.
- Write an Airflow SparkSqlOperator task to load this file into our destination table, overwriting the entire partition given by {wiki_dbname, snapshot_date}