Clean the HDFS reports files generated by the checks happening before refine webrequest.
Those very small files contain the result of a Hive query describing data losses in wmf_raw.webrequest. When not empty, they are joined into an alert email, and later kept on HDFS as an archive.
The webrequest-sequence-stats are already cleaned up with this entry point: https://github.com/wikimedia/operations-puppet/blob/1d8ae311fe92d98def621771d4bdfa6e4f83f233/modules/profile/manifests/analytics/refinery/job/data_purge.pp#L63
Now, we need a similar process to clean up this HDFS dir. Note that, it would be slightly different than what already exists in data_purge as the directories are not represented by a Hive table.
The process generating those files is currently being migrated by: T327073
https://github.com/wikimedia/analytics-refinery/blob/e6382e4d684a23094c6253de83823b72ce18dde8/oozie/webrequest/load/bundle.properties#L79
The steps:
- Add systemd timer in puppet https://github.com/wikimedia/operations-puppet/blob/1d8ae311fe92d98def621771d4bdfa6e4f83f233/modules/profile/manifests/analytics/refinery/job/data_purge.pp
- Add test to the specific format used by webrequests_data_loss in the specs of refinery-drop-older-than