Raw, refined, data loss partitions should be automatically purged according to a data retention policy.
AC:
- a retention policy is agreed upon and documented.
- a retention policy is enforced for webrequest_frontend (haproxykafka) data.
Raw, refined, data loss partitions should be automatically purged according to a data retention policy.
AC:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | gmodena | T354694 [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition | |||
Open | gmodena | T379024 Implement a data retention policy for webrequest_frontend datasets |
gmodena opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/940
wmf_airflow_common: enforce data retention.
RFC for extending DataRegistry to support data retention policies for HiveDatasets: wmf_airflow_common Datasets retention policy
gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/940
wmf_airflow_common: add drop_older_than utility method.
A lot of discussion about this task happened in slack / OTR. While we don't have a standard way to enforce data retention for HiveDatasets in airflow, we do have a couple of refinery scripts in place currently orchestrated via systemd timer. Search has a wrapper for these scripts, and runs them on Airflow via BashOperator.
I have two works streams, that want to generalize data retention policies in Airflow:
gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/941
Draft: analytics: webrequest_frontend: implement data retention.
gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/940
wmf_airflow_common: add drop_older_than utility method.
gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/941
analytics: webrequest_frontend: implement data retention.