Page MenuHomePhabricator

Audit session storage to determine max age of un-GC'd sessions
Open, MediumPublic

Description

Session data contains PII and is thus bound by Wikimedia's data retention guidelines. While sessions expire well before the max retention period (currently 90 days), it is not immediately removed from storage (Cassandra). Expired data is retained for a minimum period (default of 10 days), and then GC'd as compaction dictates. While it seems highly unlikely that actual retention will be anywhere near 90 days, the exact duration is difficult to reason about because it is a function of so many factors (throughput, cardinality, compaction concurrency, etc, etc); The easiest way to answer how long sessions are retained is to conduct an audit after session storage has been in use for a few weeks.

Event Timeline

Eevans created this task.May 11 2019, 12:32 AM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMay 11 2019, 12:32 AM
Eevans triaged this task as Medium priority.May 11 2019, 12:33 AM

@Eevans is this a task for you or were you looking for input from @EvanProdromou ?

@Eevans is this a task for you or were you looking for input from @EvanProdromou ?

@kchapman it's a task for me (to follow-up on some time after moving to production)

Anything left to do here?

Eevans added a comment.Jun 4 2020, 4:25 PM

Anything left to do here?

Yes; Now that we've got the entire workload on the cluster, we should wait for a period of at least 10 days (though 30 days would be my recommendation), and then audit the dataset. How exactly we go about this audit is something that needs to be sussed out; The goal would be to establish confidence that we do not have tombstones hanging around that would violate our data retention guidelines.

I'd be happy to discuss methodologies for such a test.