Page MenuHomePhabricator

Implement a data retention policy for webrequest_frontend datasets
Open, Needs TriagePublic

Description

Raw, refined, data loss partitions should be automatically purged according to a data retention policy.

AC:

  • a retention policy is agreed upon and documented.
  • a retention policy is enforced for webrequest_frontend (haproxykafka) data.

Event Timeline

RFC for extending DataRegistry to support data retention policies for HiveDatasets: wmf_airflow_common Datasets retention policy

A lot of discussion about this task happened in slack / OTR. While we don't have a standard way to enforce data retention for HiveDatasets in airflow, we do have a couple of refinery scripts in place currently orchestrated via systemd timer. Search has a wrapper for these scripts, and runs them on Airflow via BashOperator.

I have two works streams, that want to generalize data retention policies in Airflow: