We need a version of refine_webrequest_hourly_dag.py that will process haproxykafka logs and generated its corresponding webrequest table.
AC:
- create staging databases for haproxykafka webrequest.
- update DDL for creating haproxykafka raw tables (and data loss metrics). Patch 1012656
- deploy a webrequest refinement airflow dag on dev
- deploy a webrequest refinement airflow dag on staging
To be defined:
- Can we run this dag on airflow-test, but with access to prod YARN and HDFS?
- That did not workout, because of too many moving parts. @Antoine_Quhen advised to stick to dev instances and analytics.
- Do we need to instrument the staging table with DQ?
- we are column-by-column compatible with the old webrequest. table, so we can simply reuse the current job and point it to wmf.webrequest_frontend once available