Keep the last 6 snapshots of datasets stored in the following HDFS directories:
- /user/analytics-platform-eng/structured-data/section_topics
- /user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images
- /user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions
- /user/analytics-platform-eng/structured-data/seal/alignments
- /user/analytics-platform-eng/structured-data/seal/embeddings
- /user/analytics-platform-eng/structured-data/seal/features
- /user/analytics-platform-eng/structured-data/seal/models
- /user/analytics-platform-eng/structured-data/seal/sections
YYYY-MM-DD sub-directories are the ones to be deleted: all of them but seal/models/YYYY-MM-DD contain datasets stored as parquet files. seal/models/YYYY-MM-DD contain pickle and CSV files.
Exceptions
The following paths shouldn’t be deleted until T339129: [L] Periodically regenerate various variable data sets/files and T325316: [XL] Productionize section alignment model training are resolved:
- /user/analytics-platform-eng/structured-data/section_topics/2022-10_ptwiki_bad
- /user/analytics-platform-eng/structured-data/section_topics/20230301_target_wikis_tables
/user/analytics-platform-eng/structured-data/section-alignment-suggestions/aligned_sections_subset_9.0_2022-02.parquet- Update: moved to trash