Druid deep storage contains old versions of segments having been re-indexed. This is problematic as those versions can contain sensitive data when re-indexation happens for sanitization purposes.
@JAllemandou Do you have more context to share around how the sanitization works on Druid? I take it we would want to remove the old segments regardless of sensitive data. Is this different from: https://github.com/wikimedia/analytics-refinery/blob/master/bin/refinery-drop-druid-deep-storage-data?
First some details on how data storage works in Druid:
- After being computed, segments are stored on HDFS (deep-storage)
- Depending on loading rules defined in Druid coordinator, historical nodes load/unload segments stored on HDFS, making them available/unavailable for query
- Unloading data from historical nodes doesn't delete data for real, as deep-storage still contains it
- Deleting data from deep storage can only happen if the data is also not loaded on historical nodes
And now some details on our data-deletion for druid:
- We use coordinator load-rules to only keep the amount of expected data on some datasources (latest 1 month for webrequest, latest 3 month for pageview_hourly and event_navigationtiming)
- We use the deep-storage data deletion script to delete segments having been already unloaded (currently only webrequest - we keep 60 days of data even if only 30 are used in the cluster, in case).
- We overwrite segments with a sanitized version of the data (for data that needs to be overwritten, currently only netflow)
The problem mentioned in this task is about the last case described above: when overwriting with sanitized version the Druid historical nodes will correctly get the latest version of the segments, but the deep-storage still contains the old versions of the segments.
I think that those segments, not being used by the coordinator, can be deleted by running a kill-task on the period for which they exist, but this needs to be tested (it's scary, cause we're asking druid to delete data for a time-period we want to keep).
Note: The retention of sensitive data due to this only applies to netflow, but there are other datasources we reindex and don't clean, leading to unused files staying on HDFS - designing a broad cleanup would be great.