Page MenuHomePhabricator

Set up a rolling retention window for event_sanitized.netflow
Open, Needs TriagePublic

Description

Background

As part of T417694 (one-time cleanup of event_sanitized), event_sanitized.netflow was identified as the largest dataset in the namespace by file count and one of the top two by size, with no retention policy in place:

  • ~4.8M files (~14% of all files in event_sanitized)
  • ~9,921 GB (~9.7 TB)
  • Actively written, data goes back to 2020

All other concerning datasets in event_sanitized have been cleaned up. This task tracks establishing and implementing a rolling retention window for netflow.

Per-year breakdown

YearFilesSize
202035,792371 GB
2021288,4121,587 GB
2022671,9131,566 GB
2023790,0491,653 GB
20241,322,9852,051 GB
20251,484,4012,358 GB
2026399,417678 GB (partial year)

Proposed approach

Apply a 3-year rolling window — automatically drop netflow data older than 3 years. This would immediately free approximately 996K files and ~3.5 TB (years 2020–2022), with ongoing savings as old years roll off.

Impact on Druid

This would not affect current Druid data — what is already loaded in Druid stays there regardless of what we delete from HDFS/Hive. However, in the event that a Druid backfill is ever needed, it would only be possible within the rolling window. netflow data older than 3 years would be unrecoverable from Hive.

Key findings from T231339

From the original netflow sanitization setup discussion in T231339:

  1. The primary consumers of netflow data are SRE/network teams using Druid/Turnilo for traffic engineering and DDoS detection — not direct Hive queries.
  2. When asked whether any periodic jobs consume the HDFS data or Hive table directly, the answer was "Nop" (T231339#1623421).
  3. @ayounsi noted the desire to keep traffic engineering data "for as long as possible" for historical trends, but acknowledged DDoS detection data can be dropped more quickly.

A 3-year window seems like a reasonable middle ground: it preserves meaningful historical coverage for trend analysis while capping growth.

Next steps

  • Check with @ayounsi / SRE network team whether a 3-year rolling window would work for them
  • Implement automatic purge job (similar to existing data_purge puppet jobs)
  • Perform one-time backfill deletion of data outside the agreed window

Event Timeline

It would be best to keep all historical data but aggregate it further as time goes. Do we already drop/aggregate all the fields mentioned in T231339: Set up automatic deletion/snitization for netflow data set in Hive ?

It would be best to keep all historical data but aggregate it further as time goes. Do we already drop/aggregate all the fields mentioned in T231339: Set up automatic deletion/snitization for netflow data set in Hive ?

@ayounsi from our point of view, the issue is not data size, but amount of files.

If you need the historical data that is ok. I think DPE will have to think about this, and perhaps long term just move these sanitization pipelines to Iceberg, which will allow us to compact the ~5 million small files from netflow into probably thousands of ~128MB files, giving you long term access and giving us less pressure on our HDFS instance.

Since this is a longer term project, I'm moving this out of our current sprint. No more action needed for now.