Goal:
Write the airflow DAG to migrate the webrequest load jobs to Airflow
Job Details:
Input | Processing | Output |
Raw JSON | Hive | Hive + Table Tests |
Success Criteria:
- Have the 2 Jobs Migrated (SLA 5 Hours)
Gotchas
- This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format.
- Job needs to be rewritten - TBD how.
Dependencies
Here is a list of Oozie jobs using the webrequest Oozie-datasets.
Jobs already migrated to Airflow:
- pageview actor
- aqs hourly
- mobile_apps/session_metrics
Jobs defined in Oozie, but it looks like the Oozie schedule is not started:
- apis
- wikidata/specialentitydata_metrics
- wikidata/articleplaceholder_metrics
- wikidata/reliability_metrics
- webrequest/subset
- mobile_apps/uniques/daily
- mediarequest/hourly
And we should make sure those jobs are still working after the migration of refine webrequest to Airflow:
- mobile_apps/uniques/monthly
- webrequest/druid/hourly & daily
- learning/features/actor/hourly
- banner activity Druid daily
- mediacounts/load
Conclusion: We will add a SUCCESS file at the end of the Airflow dag, and the remaining Oozie jobs will be triggered properly.
Subtasks to do:
- create HQL file to refine
- create DAG scoped on the refine process
- create DAG for the test cluster (this specific source is test_text)
- add new dag task test fixtures
- Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the Airflow dag
- Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the HQL folder
- Manual tests of HQL files with Spark 3 (get optimal parameters for distribution)
- Manual tests of the Airflow DAGs on statbox
- Review all Oozie code and salvage comments
- Review datasets doc on Datahub & Wikitech