Background
As part of T417694 (one-time cleanup of event_sanitized), event_sanitized.netflow was identified as the largest dataset in the namespace by file count and one of the top two by size, with no retention policy in place:
- ~4.8M files (~14% of all files in event_sanitized)
- ~9,921 GB (~9.7 TB)
- Actively written, data goes back to 2020
All other concerning datasets in event_sanitized have been cleaned up. This task tracks establishing and implementing a rolling retention window for netflow.
Per-year breakdown
| Year | Files | Size |
|---|---|---|
| 2020 | 35,792 | 371 GB |
| 2021 | 288,412 | 1,587 GB |
| 2022 | 671,913 | 1,566 GB |
| 2023 | 790,049 | 1,653 GB |
| 2024 | 1,322,985 | 2,051 GB |
| 2025 | 1,484,401 | 2,358 GB |
| 2026 | 399,417 | 678 GB (partial year) |
Proposed approach
Apply a 3-year rolling window — automatically drop netflow data older than 3 years. This would immediately free approximately 996K files and ~3.5 TB (years 2020–2022), with ongoing savings as old years roll off.
Impact on Druid
This would not affect current Druid data — what is already loaded in Druid stays there regardless of what we delete from HDFS/Hive. However, in the event that a Druid backfill is ever needed, it would only be possible within the rolling window. netflow data older than 3 years would be unrecoverable from Hive.
Key findings from T231339
From the original netflow sanitization setup discussion in T231339:
- The primary consumers of netflow data are SRE/network teams using Druid/Turnilo for traffic engineering and DDoS detection — not direct Hive queries.
- When asked whether any periodic jobs consume the HDFS data or Hive table directly, the answer was "Nop" (T231339#1623421).
- @ayounsi noted the desire to keep traffic engineering data "for as long as possible" for historical trends, but acknowledged DDoS detection data can be dropped more quickly.
A 3-year window seems like a reasonable middle ground: it preserves meaningful historical coverage for trend analysis while capping growth.
Next steps
- Check with @ayounsi / SRE network team whether a 3-year rolling window would work for them
- Implement automatic purge job (similar to existing data_purge puppet jobs)
- Perform one-time backfill deletion of data outside the agreed window