Page MenuHomePhabricator

Migration of Image Suggestion Job to DE Data Pipeline
Closed, ResolvedPublic3 Estimated Story Points

Description

In T310692, we investigated the steps required to migrate the image_suggestions Airflow workflow into the new convetions under airflow_dags project.

In this task we will do the actual work.

The work required for the migration is as follows:

  • Move the image_suggestions code verbatim to airflow_dags as a first step. It would live in a new folder "platform_eng", which would map to their new Airflow instance.
  • Modify the dag from using “Tasks that converts to an Operator” abstractions used in datapipelines to the “operators with defaults” abstraction used in airflow_dags
  • Separate the pyspark code into its own gitlab repo, and use the approach of gitlab CI to generate a conda environment out of it (following this example).
  • Modify the dag to use the SparkOperator#for_virtualenv() abstraction to launch Spark jobs instead of the current use of a BashOperator.
    • I did a quick test for this and it was successful, and the operator call is indeed cleaner:
SparkSubmitOperator.for_virtualenv(
        task_id="commons_index",
        virtualenv_archive = artifact('image-suggestions.conda.tgz'),
        entry_point = 'lib/python3.7/site-packages/src/commonswiki_file.py',
        launcher = 'skein',
        application_args = args,
        dag = dag
    )
  • Wait for a run that is successful.
  • Present the findings and the porting MR to Marco so that he is aware of the changes.
  • Spin up a new Airflow server that would be the new platform_eng instance that follows the conventions from airflow_dags (and make sure to announce so that folks potentially affected by this are aware)
  • Modify documentations so that it is clear what the current airflow instances are, and what are the dev instructions.
  • Return the ownership of the job to Marco’s team.

Event Timeline

( For the github repo containing the business logic, following git instructions at https://stackoverflow.com/questions/1365541/how-to-move-some-files-from-one-git-repo-to-another-not-a-clone-preserving-hi to avoid losing the commit history. )

Making good progress here.

The current draft of the ported Airflow DAG can be see at: https://gitlab.wikimedia.org/xcollazo/airflow-dags/-/blob/port-image-suggestions-v3/platform_eng/dags/image-suggestions_dag.py. I've done a couple test runs and both the Sensors and Spark jobs are running fine. Have not tried the Cassandra jobs yet to avoid breaking prod.

The business logic is now on this repo: https://gitlab.wikimedia.org/repos/generated-data-platform/image_suggestions.

Still to do:
Fix bug with GitLab CI to generate the conda artifact.
With help from @Ottomata, spin up a new airflow instance.

Related to this work, but perhaps not a blocker: try and fix T311646.

With help from @Ottomata, spin up a new airflow instance.

I think it would be a little (just a little!) bit easier to do as planned and wipe the current platform_eng airflow instance and start from scratch there. If that is not possible, we can make a new instance (but we may need to do some naming conflict tricks in puppet to keep both online with different settings).

@emil @JArguello-WMF @xcollazo is getting close to ready here, and he needs us to provide him with a new airflow instance (on ganeti VPS) that conforms to our patterns. Tagging your for sprint planning and prioritization.

EChetty set the point value for this task to 5.Aug 1 2022, 1:05 PM
EChetty changed the point value for this task from 5 to 3.
EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

Just transferred the image-suggestions gitlab repo from generated-data-platform group into its new home at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions. It is now owned by the structured-data folks.

T315633 is moving slower than expected, and in reality, it is work that is separate from the main migration. To close on image-suggestion, I'll detach that task and pursue it independently.

Just transferred the image-suggestions gitlab repo from generated-data-platform group into its new home at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions. It is now owned by the structured-data folks.

With this, we are done with this migration. Closing!