**Goal:**
Write the airflow DAG to migrate the webrequest load jobs to Airflow
**Job Details:**
| Input | Processing | Output
| Raw JSON | Hive | Hive + Table Tests
**Success Criteria:**
- Have the 2 Jobs Migrated (SLA 5 Hours)
=====Gotchas
* This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format.
* Job needs to be rewritten - TBD how.
=====Dependencies
Here is a list of Oozie jobs using the webrequest Oozie-datasets.
Jobs already migrated to Airflow:
[X] pageview actor
[X] aqs hourly
[X] mobile_apps/session_metrics
Jobs defined in Oozie, but it looks like the Oozie schedule is not started:
[X] apis
[X] wikidata/specialentitydata_metrics
[X] wikidata/articleplaceholder_metrics
[X] wikidata/reliability_metrics
[X] webrequest/subset
[X] mobile_apps/uniques/daily
[X] mediarequest/hourly
And we should make sure those jobs are still working after the migration of refine webrequest to Airflow:
[X] mobile_apps/uniques/monthly
[ ] webrequest/druid/hourly & daily
[X] learning/features/actor/hourly
[ ] banner activity Druid daily
[ ] mediacounts/load
Conclusion: We will add a SUCCESS file at the end of the Airflow dag, and the remaining Oozie jobs will be triggered properly.
Subtasks to do:
[X] create HQL file to refine
[X] create DAG scoped on the refine process
[X] create DAG for the test cluster (this specific source is `test_text`)
[X] add new dag task test fixtures
[X] Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the Airflow dag
[ ] Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the HQL folder
[X] Manual tests of HQL files with Spark 3 (get optimal parameters for distribution)
[ ] Manual tests of the Airflow DAGs on statbox
[ ] Review all Oozie code and salvage comments
[ ] Review datasets doc on Datahub & Wikitech