Page MenuHomePhabricator

[HAProxy transition] Deploy a staging airflow dag for webrequest refinement
Open, Needs TriagePublic

Description

We need a version of refine_webrequest_hourly_dag.py that will process haproxykafka logs and generated its corresponding webrequest table.

AC:

  • create staging databases for haproxykafka webrequest.
  • update DDL for creating haproxykafka raw tables (and data loss metrics). Patch 1012656
  • deploy a webrequest refinement airflow dag on dev
  • deploy a webrequest refinement airflow dag on staging

To be defined:

  • Can we run this dag on airflow-test, but with access to prod YARN and HDFS?
    • That did not workout, because of too many moving parts. @Antoine_Quhen advised to stick to dev instances and analytics.
  • Do we need to instrument the staging table with DQ?
    • we are column-by-column compatible with the old webrequest. table, so we can simply reuse the current job and point it to wmf.webrequest_frontend once available

Event Timeline

Change #1012656 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] hql: webrequest: add webrequest_frontend.

https://gerrit.wikimedia.org/r/1012656

Ahoelzl renamed this task from Deploy a staging airflow dag for webrequest refinement to [HAProxy transition] Deploy a staging airflow dag for webrequest refinement .Nov 8 2024, 4:59 PM

gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/892

Draft: add a staging airflow dag for webrequest_frontend refinement

This dag is now running on a dev airflow instance (stat1008:8282), and is processing haproxykafka data loaded by Gobblin in /wmf/data/raw/webrequest_frontend/.

Raw and refined data is available via Superset in gmodena.webreuqest_frontend_raw and gmodena.webrequest_frontend respectively.
So far, the records look good. No obvious issues or regressions have been found. I would like to keep the dev instance running over the weekend and then roll it out to analytics, writing to production paths sometime next week.

Change #1012656 merged by Gmodena:

[analytics/refinery@master] hql: webrequest: add webrequest_frontend.

https://gerrit.wikimedia.org/r/1012656

The webrequest_frontend dag is now deployed on the Airflow analytics instance, producing data to /wmf/data on an hourly schedule. You can access haproxykafka datasets via superset by querying:

  • wmf_staging.webrequest_frontend: raw json records imported with gobblin. his table is the equivalent of wmf_raw.werbequest.
  • wmf_staging.webrequest: canonical, "refined" webrequest data. This table is the equivalent of wmf.werbequest.

Data is currently available from 2024-01-01 onwards, but we need to finalize a data retention policy. That discussion is happening in T379024: Implement a data retention policy for webrequest_frontend datasets