Change Details

**Goal:** Write the airflow DAG to migrate the webrequest load jobs to Airflow **Job Details:** | Input | Processing | Output | Raw JSON | Hive | Hive + Table Tests **Success Criteria:** - Have the 2 Jobs Migrated (SLA 5 Hours) =====Gotchas * This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format. * Job needs to be rewritten - TBD how. =====Dependencies Here is a list of Oozie jobs using the webrequest Oozie-datasets. Jobs already migrated to Airflow: [X] pageview actor [X] aqs hourly [X] mobile_apps/session_metrics Jobs defined in Oozie, but it looks like the Oozie schedule is not started: [X] apis [X] wikidata/specialentitydata_metrics [X] wikidata/articleplaceholder_metrics [X] wikidata/reliability_metrics [X] webrequest/subset [X] mobile_apps/uniques/daily [X] mediarequest/hourly And we should make sure those jobs are still working after the migration of refine webrequest to Airflow: [X] mobile_apps/uniques/monthly [ ] webrequest/druid/hourly & daily [X] learning/features/actor/hourly [ ] banner activity Druid daily [ ] mediacounts/load Conclusion: We will add a SUCCESS file at the end of the Airflow dag, and the remaining Oozie jobs will be triggered properly. Subtasks to do: [X] create HQL file to refine [X] create DAG scoped on the refine process [X] create DAG for the test cluster (this specific source is `test_text`) [X] add new dag task test fixtures [X] Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the Airflow dag [ X] Add back from Oozie the data quality mechanism (sequence statistics tables + emails) in the HQL folder [X] Manual tests of HQL files with Spark 3 (get optimal parameters for distribution) [ ] Manual tests of the Airflow DAGs on statbox [ ] Review all Oozie code and salvage comments [ ] Review datasets doc on Datahub & Wikitech