Page MenuHomePhabricator

Auto clean /wmf/data/raw/webrequests_data_loss
Closed, ResolvedPublic3 Estimated Story Points

Description

Clean the HDFS reports files generated by the checks happening before refine webrequest.

Those very small files contain the result of a Hive query describing data losses in wmf_raw.webrequest. When not empty, they are joined into an alert email, and later kept on HDFS as an archive.

The webrequest-sequence-stats are already cleaned up with this entry point: https://github.com/wikimedia/operations-puppet/blob/1d8ae311fe92d98def621771d4bdfa6e4f83f233/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L63

Now, we need a similar process to clean up this HDFS dir. Note that, it would be slightly different than what already exists in data_purge as the directories are not represented by a Hive table.

The process generating those files is currently being migrated by: T327073
https://github.com/wikimedia/analytics-refinery/blob/e6382e4d684a23094c6253de83823b72ce18dde8/oozie/webrequest/load/bundle.properties#L79

The steps:

Event Timeline

(Note to self: we should leverage the work from this task and apply it to the new non-hive HDFS data being generated for Section-Topics . CC @Cparle @mfossati )

(Also, this task is related to T326826.)

JArguello-WMF set the point value for this task to 3.Apr 3 2023, 4:18 PM

Instead of using Airflow, would we reuse the existing systemd-timer scheme in place?
We know we wish to revamp data-purging to use airflow, but I don't think it should be done/started as part of tis task.

OK to separate the migration from this task.

Here is what I've suggested before modifying the task:

  • Add docopt to conda-analytics
  • Isolate analytics/refinery/python into its Gitlab repo
  • Extract unit tests from code & run them through CI
  • Version & archive the Gitlab repo
  • Create a Skein simple operator to call the code from Airflow

Change 908776 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery@master] Add purge job for webrequest data loss reports

https://gerrit.wikimedia.org/r/908776

Change 908777 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] analytics: Add purge job for webrequest data loss reports

https://gerrit.wikimedia.org/r/908777

Change 908776 merged by Aqu:

[analytics/refinery@master] Add unit tests on raw webrequest data loss reports job

https://gerrit.wikimedia.org/r/908776

Change 908777 merged by Btullis:

[operations/puppet@production] analytics: Add purge job for webrequest data loss reports

https://gerrit.wikimedia.org/r/908777

Antoine_Quhen moved this task from Ready to Deploy to Done on the Data Pipelines (Sprint 12) board.

I've checked the result on HDFS. It performs as expected.