In conversations with @JAllemandou, we have learnt that keeping an API like the Pageviews API over the course of many years has lead to some learnings on data retention. Mainly, keeping the data available forever has caused Cassandra performance issues. Attempting a data deletion now in such a database size is impractical, and the TTL feature of a Cassandra table can only be set at creation time.
Thus we should consider setting TTLs for all tables of Commons Impact Metrics before we launch.
In this ticket we should:
- Wait until we have a final allow-list. Do a full run of all the ETL all the way to Cassandra, and see how many rows/data size we get for a couple of backfilled months.
- Extrapolate what the size in Cassandra would be for all our 14 tables over 1, 3, 5 years.
- With this info, consider whether we'd need to set TTLs, and if so, at what duration.
- If no TTL, consider what to do long term with data: should we delete? If so, when should that happen? After 1, 3, 5 years? Long term we want T366631.