Page MenuHomePhabricator

Cleanup the /wmf/data/discovery/transfer_to_es folder in hdfs
Closed, ResolvedPublic

Description

The hdfs path /wmf/data/discovery/transfer_to_es is populated by the transfer_to_es job that can be run hourly and thus might create many folders and files. The oldest snapshot is 20200105.
It might make sense to have an automated cleanup process for this dataset, the retention data is yet to be defined (60 days?).

AC:

  • /wmf/data/discovery/transfer_to_es is cleaned up regularly

Details

TitleReferenceAuthorSource BranchDest Branch
Script to cleanup transfer_to_es folder in hdfsrepos/search-platform/discolytics!6ebernhardsonwork/ebernhardson/cleanup-transfer-to-esmain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MPhamWMF moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

I have this written up and have run it manually, so the existing data has been cleaned up. For automated runs i'm delaying until we have the new airflow instance up and running, with the intent of scheduling it in airflow 2

Waiting on an airflow instance to be deployed (T327970)