The task T335862: Implement job to generate Dump XML files will have as output a Spark job.
In this task we should implement an Airflow job that:
-
Waits on upstream table wmf_dumps.wikitext_raw_rc2 to be ready (need to figure out what ready means?)TBD later. - Runs the Spark job.
-
Runs the offline script that takes the spark output files and renames them. (Job should generate one file per partition, thus there should be a 1:1 name correspondece between partition folder and the one partition file inside it).This is done as part of the Spark job itself. - Publishes the files. (for now, publishing should be internal, so maybe just a mv to a well known location in HDFS?)