Page MenuHomePhabricator

Write Airflow DAG to move the webrequest load job to airflow.
Closed, ResolvedPublic1 Estimated Story Points

Description

Goal:
Write the airflow DAG to migrate the webrequest load jobs to Airflow

Job Details:

InputProcessingOutput
Raw JSONHiveHive + Table Tests

Success Criteria:

  • Have the 2 Jobs Migrated (SLA 5 Hours)
Gotchas
  • This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format.
  • Job needs to be rewritten - TBD how.
Dependencies

Here is a list of Oozie jobs using the webrequest Oozie-datasets.

Jobs already migrated to Airflow:

  • pageview actor
  • aqs hourly
  • mobile_apps/session_metrics

Jobs defined in Oozie, but it looks like the Oozie schedule is not started:

  • apis
  • wikidata/specialentitydata_metrics
  • wikidata/articleplaceholder_metrics
  • wikidata/reliability_metrics
  • webrequest/subset
  • mobile_apps/uniques/daily
  • mediarequest/hourly

And we should make sure those jobs are still working after the migration of refine webrequest to Airflow:

  • mobile_apps/uniques/monthly
  • webrequest/druid/hourly & daily
  • learning/features/actor/hourly
  • banner activity Druid daily
  • mediacounts/load

Conclusion: We will add a SUCCESS file at the end of the Airflow dag, and the remaining Oozie jobs will be triggered properly.

Subtasks to do:

  • create HQL file to refine
  • create DAG scoped on the refine process
  • create DAG for the test cluster (this specific source is test_text)
  • add new dag task test fixtures
  • Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the Airflow dag
  • Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the HQL folder
  • Manual tests of HQL files with Spark 3 (get optimal parameters for distribution)
  • Manual tests of the Airflow DAGs on statbox
  • Review all Oozie code and salvage comments
  • Create 1 dag per webrequest_source
  • Review datasets doc on Datahub & Wikitech

Event Timeline

EChetty set the point value for this task to 3.Jan 17 2023, 11:48 AM

In the description, I've added a list of jobs that look like dependencies.

Change 894661 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Migrate refine webrequest to Airflow

https://gerrit.wikimedia.org/r/894661

JArguello-WMF changed the point value for this task from 3 to 1.Apr 3 2023, 4:50 PM

Change 894661 merged by Aqu:

[analytics/refinery@master] Migrate refine webrequest to Airflow

https://gerrit.wikimedia.org/r/894661

Change 908529 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Remove extra check on webrequest _SUCCESS files on HDFS

https://gerrit.wikimedia.org/r/908529

Change 908533 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Prepare removal of systemd_timer check_webrequest_partitions

https://gerrit.wikimedia.org/r/908533

Change 908533 merged by Elukey:

[operations/puppet@production] analytics: Prepare removal of systemd_timer check_webrequest_partitions

https://gerrit.wikimedia.org/r/908533

Change 908529 merged by Ottomata:

[operations/puppet@production] analytics: Remove extra check on webrequest _SUCCESS files on HDFS

https://gerrit.wikimedia.org/r/908529