Session data contains PII and is thus bound by Wikimedia's data retention guidelines. While sessions expire well before the max retention period (currently 90 days), it is not immediately removed from storage (Cassandra). Expired data is retained for a minimum period (default of 10 days), and then GC'd as compaction dictates. While it seems highly unlikely that actual retention will be anywhere near 90 days, the exact duration is difficult to reason about because it is a function of so many factors (throughput, cardinality, compaction concurrency, etc, etc); The easiest way to answer how long sessions are retained is to conduct an audit after session storage has been in use for a few weeks.
|Open||None||T88445 MediaWiki active/active datacenter investigation and work (tracking)|
|Stalled||Eevans||T206016 Create a service for session storage|
|Open||Krinkle||T270223 FY2020-2021: Enable basic Multi-DC operations for read traffic (tracking)|
|Open||None||T270225 Finish session storage to actually meet multi-DC requirements|
|Open||Eevans||T222990 Audit session storage to determine max age of un-GC'd sessions|
Yes; Now that we've got the entire workload on the cluster, we should wait for a period of at least 10 days (though 30 days would be my recommendation), and then audit the dataset. How exactly we go about this audit is something that needs to be sussed out; The goal would be to establish confidence that we do not have tombstones hanging around that would violate our data retention guidelines.
I'd be happy to discuss methodologies for such a test.