Page MenuHomePhabricator

[Maintenance] Set up deletion jobs for Structured Data's data pipelines
Open, Needs TriagePublic

Description

Keep the last 6 snapshots of datasets stored in the following HDFS directories:

  • /user/analytics-platform-eng/structured-data/section_topics
  • /user/analytics-platform-eng/structured-data/section-alignment-suggestions/article_images
  • /user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions
  • /user/analytics-platform-eng/structured-data/seal/alignments
  • /user/analytics-platform-eng/structured-data/seal/embeddings
  • /user/analytics-platform-eng/structured-data/seal/features
  • /user/analytics-platform-eng/structured-data/seal/models
  • /user/analytics-platform-eng/structured-data/seal/sections

YYYY-MM-DD sub-directories are the ones to be deleted: all of them but seal/models/YYYY-MM-DD contain datasets stored as parquet files. seal/models/YYYY-MM-DD contain pickle and CSV files.

Exceptions

The following paths shouldn’t be deleted until T339129: [L] Periodically regenerate various variable data sets/files and T325316: [XL] Productionize section alignment model training are resolved:

  • /user/analytics-platform-eng/structured-data/section_topics/2022-10_ptwiki_bad
  • /user/analytics-platform-eng/structured-data/section_topics/20230301_target_wikis_tables
  • /user/analytics-platform-eng/structured-data/section-alignment-suggestions/aligned_sections_subset_9.0_2022-02.parquet - Update: moved to trash

Event Timeline

@mfossati thanks for submitting this deletion request. Do you have a need done by date?

Hey @VirginiaPoundstone , I don't think there's any deadline from our side. However, please note that this was initially raised as one of the causes that put the Hadoop cluster under pressure, CC @JAllemandou .

Ahoelzl renamed this task from Set up deletion jobs for Structured Data's data pipelines to [Maintenance] Set up deletion jobs for Structured Data's data pipelines.Oct 20 2023, 4:55 PM

Hi @VirginiaPoundstone , @JAllemandou : friendly note that snapshots are accumulating. For instance:

mfossati@stat1008:~$ hdfs dfs -count -v /user/analytics-platform-eng/structured-data/section_topics/*
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
   DIR_COUNT   FILE_COUNT       CONTENT_SIZE PATHNAME
           1            3            4330029 /user/analytics-platform-eng/structured-data/section_topics/2022-10_ptwiki_bad
           1         4097        12031399662 /user/analytics-platform-eng/structured-data/section_topics/2023-08-07
           1         4097        12031172825 /user/analytics-platform-eng/structured-data/section_topics/2023-08-14
           1         4097        12031210711 /user/analytics-platform-eng/structured-data/section_topics/2023-08-21
           1         4097        12095932268 /user/analytics-platform-eng/structured-data/section_topics/2023-08-28
           1         4097        12098362269 /user/analytics-platform-eng/structured-data/section_topics/2023-09-04
           1         4097        12877206250 /user/analytics-platform-eng/structured-data/section_topics/2023-09-11
           1         4097        12876737412 /user/analytics-platform-eng/structured-data/section_topics/2023-09-18
           1         4097        12940746465 /user/analytics-platform-eng/structured-data/section_topics/2023-09-25
           1         4097        12941751363 /user/analytics-platform-eng/structured-data/section_topics/2023-10-02
           1         4097        12942323394 /user/analytics-platform-eng/structured-data/section_topics/2023-10-09
           1         4097        12942153568 /user/analytics-platform-eng/structured-data/section_topics/2023-10-16
           1         4097        12942204298 /user/analytics-platform-eng/structured-data/section_topics/2023-10-23
           1         4097        13017014595 /user/analytics-platform-eng/structured-data/section_topics/2023-10-30
           1         4097        13017218280 /user/analytics-platform-eng/structured-data/section_topics/2023-11-06
           1         4097        13017965206 /user/analytics-platform-eng/structured-data/section_topics/2023-11-13
           1         4097        13017750522 /user/analytics-platform-eng/structured-data/section_topics/2023-11-20
           1         4097        13088798677 /user/analytics-platform-eng/structured-data/section_topics/2023-11-27
           1         4097        13090007870 /user/analytics-platform-eng/structured-data/section_topics/2023-12-04
           1         4097        13090992956 /user/analytics-platform-eng/structured-data/section_topics/2023-12-11
           1         4097        13093305227 /user/analytics-platform-eng/structured-data/section_topics/2023-12-18
           1            3           39125826 /user/analytics-platform-eng/structured-data/section_topics/20230301_target_wikis_tables

Not a big deal, although it would be great if you could tackle this ticket, the sooner the better 😄 .

@lbowmaker @JAllemandou , I was thinking that perhaps we could implement these deletion jobs as tasks in our DAGs. Something like a Bash operator that merely deletes relevant HDFS files.
Not sure how to keep the last 6 snapshots, though: maybe with some Airflow template?

Anyway, please let me know if I can help get this ticket done.

Thanks a log for not forgetting about this ticket @mfossati :)
the Data Engineering team is on the road toward providing you with (hopefully) an easy enough way to configure data deletion for your datasets.
In the meantime, manual deletion every now and then should be enough.
I don't think it's worth investing time on this before the new system comes in (probably a few months).
Is that ok for you?

Thanks a log for not forgetting about this ticket @mfossati :)
the Data Engineering team is on the road toward providing you with (hopefully) an easy enough way to configure data deletion for your datasets.
In the meantime, manual deletion every now and then should be enough.
I don't think it's worth investing time on this before the new system comes in (probably a few months).
Is that ok for you?

@JAllemandou, sure!
Is there a relevant ticket I can subscribe to?