In T310692, we investigated the steps required to migrate the image_suggestions Airflow workflow into the new convetions under airflow_dags project.
In this task we will do the actual work.
The work required for the migration is as follows:
- Move the image_suggestions code verbatim to airflow_dags as a first step. It would live in a new folder "platform_eng", which would map to their new Airflow instance.
- Modify the dag from using “Tasks that converts to an Operator” abstractions used in datapipelines to the “operators with defaults” abstraction used in airflow_dags
- Separate the pyspark code into its own gitlab repo, and use the approach of gitlab CI to generate a conda environment out of it (following this example).
- Modify the dag to use the SparkOperator#for_virtualenv() abstraction to launch Spark jobs instead of the current use of a BashOperator.
- I did a quick test for this and it was successful, and the operator call is indeed cleaner:
SparkSubmitOperator.for_virtualenv( task_id="commons_index", virtualenv_archive = artifact('image-suggestions.conda.tgz'), entry_point = 'lib/python3.7/site-packages/src/commonswiki_file.py', launcher = 'skein', application_args = args, dag = dag )
- Wait for a run that is successful.
- Present the findings and the porting MR to Marco so that he is aware of the changes.
- Spin up a new Airflow server that would be the new platform_eng instance that follows the conventions from airflow_dags (and make sure to announce so that folks potentially affected by this are aware)
- Modify documentations so that it is clear what the current airflow instances are, and what are the dev instructions.
- Return the ownership of the job to Marco’s team.