Page MenuHomePhabricator

Investigate Migration of Image Suggestion Job to DE Data Pipeline
Closed, ResolvedPublic

Description

Overview
The first Image Suggestion Data Pipeline was built on the Platform Engineering Airflow instance, we want to migrate and merge into the Data Pipeline infrastructure provided by Data Engineering(DE).

Done

  • Review existing Image Suggestion Data Pipeline
  • Identify gaps between Platform and DE systems
  • In concert with DE, capture work required to port Image Suggestion pipeline to Data Pipeline Infrastructure provided by DE

Additional Links

Event Timeline

WDoranWMF renamed this task from Investigate Migration of Image Suggestion Job to DE Data Pipeline to [NEEDS GROOMING]Investigate Migration of Image Suggestion Job to DE Data Pipeline.Jun 15 2022, 11:34 AM
WDoranWMF renamed this task from [NEEDS GROOMING]Investigate Migration of Image Suggestion Job to DE Data Pipeline to [NEEDS GROOMING] Investigate Migration of Image Suggestion Job to DE Data Pipeline.
WDoranWMF moved this task from Backlog to Investigate 🔍 on the Generated Data Platform board.

I had meetings with the stakeholders for this effort.

@mfossati owner of the pipeline we intend to port.
@gmodena, who put together the scaffolding for the datapipelines git repo.
@mforns and @Ottomata, who put together the airflow_dags git repo.

The work required for the migration is as follows:

  • Move the image_suggestions code verbatim to airflow_dags as a first step. It would live in a new folder "platform_eng", which would map to their new Airflow instance.
  • Modify the dag from using “Tasks that converts to an Operator” abstractions used in datapipelines to the “operators with defaults” abstraction used in airflow_dags
  • Separate the pyspark code into its own gitlab repo, and use the approach of gitlab CI to generate a conda environment out of it (following this example).
  • Modify the dag to use the SparkOperator#for_virtualenv() abstraction to launch Spark jobs instead of the current use of a BashOperator.
    • I did a quick test for this and it was successful, and the operator call is indeed cleaner:
SparkSubmitOperator.for_virtualenv(
        task_id="commons_index",
        virtualenv_archive = artifact('image-suggestions.conda.tgz'),
        entry_point = 'lib/python3.7/site-packages/src/commonswiki_file.py',
        launcher = 'skein',
        application_args = args,
        dag = dag
    )
  • Wait for a run that is successful.
  • Present the findings and the porting MR to Marco so that he is aware of the changes.
  • Spin up a new Airflow server that would be the new platform_eng instance that follows the conventions from airflow_dags (and make sure to announce so that folks potentially affected by this are aware)
  • Return the ownership of the job to Marco’s team.
xcollazo renamed this task from [NEEDS GROOMING] Investigate Migration of Image Suggestion Job to DE Data Pipeline to Investigate Migration of Image Suggestion Job to DE Data Pipeline.Jun 27 2022, 2:16 PM
xcollazo updated the task description. (Show Details)

Created T311417 to track the implementation work.

Closing this one as done.

Great summary, thanks @XCollazo-WMF.