Page MenuHomePhabricator

Write Airflow DAG to move the webrequest load job to airflow.
Open, Needs TriagePublic3 Estimated Story Points


Write the airflow DAG to migrate the webrequest load jobs to Airflow

Job Details:

Raw JSONHiveHive + Table Tests

Success Criteria:

  • Have the 2 Jobs Migrated (SLA 5 Hours)
  • This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format.
  • Job needs to be rewritten - TBD how.

Here is a list of Oozie jobs using the webrequest Oozie-datasets.

Jobs already migrated to Airflow:

  • pageview actor
  • aqs hourly
  • mobile_apps/session_metrics

Jobs defined in Oozie, but it looks like the Oozie schedule is not started:

  • apis
  • wikidata/specialentitydata_metrics
  • wikidata/articleplaceholder_metrics
  • wikidata/reliability_metrics
  • webrequest/subset
  • mobile_apps/uniques/daily
  • mediarequest/hourly

And we should make sure those jobs are still working after the migration of refine webrequest to Airflow:

  • mobile_apps/uniques/monthly
  • webrequest/druid/hourly & daily
  • learning/features/actor/hourly
  • banner activity Druid daily
  • mediacounts/load

Conclusion: We will add a SUCCESS file at the end of the Airflow dag, and the remaining Oozie jobs will be triggered properly.

Subtasks to do:

  • create HQL file to refine
  • create DAG scoped on the refine process
  • create DAG for the test cluster (this specific source is test_text)
  • add new dag task test fixtures
  • Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the Airflow dag
  • Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the HQL folder
  • Manual tests of HQL files with Spark 3 (get optimal parameters for distribution)
  • Manual tests of the Airflow DAGs on statbox
  • Review all Oozie code and salvage comments
  • Review datasets doc on Datahub & Wikitech


ReferenceSource BranchDest BranchAuthorTitle
repos/data-engineering/airflow-dags!260T327073_migrate_refine_webrequest_to_airflowmainaqu[Draft] Migrate refine webrequest job to Airflow
Customize query in GitLab

Event Timeline

EChetty set the point value for this task to 3.Jan 17 2023, 11:48 AM

In the description, I've added a list of jobs that look like dependencies.

Change 894661 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Migrate refine webrequest to Airflow