In conversations with @JAllemandou, we have learnt that keeping an API like the [[ https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews | Pageviews API ]] over the course of many years has lead to some learnings on data retention. Mainly, keeping the data available forever has caused Cassandra performance issues. Attempting a data deletion now in such a database size is impractical, and the TTL feature of a Cassandra table can only be set at creation time.
Thus we should consider settting TTLs for all tables of Commons Impact Metrics before we launch.
In this ticket we should:
[x] Wait until we have a final allow-list. Do a full run of all the ETL all the way to Cassandra, and see how many rows/data size we get for a couple of backfilled months.
[x] Extrapolate what the size in Cassandra would be for all our 14 tables over 1, 3, 5 years.
[x] With this info, consider whether we'd need to set TTLs, and if so, at what duration.
[] If no TTL, consider what to do long term with data: should we delete? If so, when should that happen? After 1, 3, 5 years?