Page MenuHomePhabricator

Productionize the Airflow DAG of section alignment-based suggestions
Closed, ResolvedPublic

Description

Research 's code repository for image suggestions based on section alignment contains an Airflow DAG: the section-level image suggestions Airflow job should take it as a dependency, i.e., wait for its completion, since its output will serve as input for the final data pipeline.

Tasks

  • migrate the DAG to airflow-dags
  • if needed, make it Spark 3 compatible
  • deploy & run in production

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/data-engineering/airflow-dags!228T328641-add-section-image-recs-dagmainxcollazoT328641 add section image recs dag
repos/structured-data/section-image-recs!1use-conda-for-dep-managementmasterxcollazoBump to Spark3. Use conda for dep management. Use poetry for build and test.
Customize query in GitLab

Event Timeline

FYI @xcollazo: I ran the Research's repo in a stat machine with Spark 3 as follows:

conda-analytics-clone dev
source conda-analytics-activate dev
pip install mwparserfromhell

These changes made it work:

  • use wmfdata.spark.create_session here
from wmfdata.spark import create_session
spark = create_session(
    type='yarn-large',
    ship_python_env=True,
    extra_settings={ 'spark.sql.shuffle.partitions': 1024 },
)
  • use a dataclass instead of a NamedTuple subclass here
from dataclasses import dataclass

@dataclass
class SectionImages:
xcollazo changed the task status from Open to In Progress.Feb 6 2023, 3:37 PM

xcollazo updated https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/merge_requests/1

Bump to Spark3. Use conda for dep management. Use poetry for build and test.

xcollazo merged https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/merge_requests/1

Bump to Spark3. Use conda for dep management. Use poetry for build and test.

(this is still waiting for review)

xcollazo updated the task description. (Show Details)

Deployed to prod.

Will do follow up work via T330667.