https://phabricator.wikimedia.org/T102991 highlighted that we had cached objects lasting longer than 30 days, which was both problematic and unexpected. That specific issue has been fixed: https://gerrit.wikimedia.org/r/#/c/229714/ , but in researching this I realized there are actually a lot of problems with how we're handling cache TTLs, especially with how they're handled across layers and tiers of caching. The key issues here are:
- We're limiting frontends to 120s object lifetimes in the common case - reduces front hitrate, and also explains hitrate anomalies on cache size increase during earlier experimentation here: https://phabricator.wikimedia.org/P969
- We're not really communicating TTLs properly from Tier1 backends to Tier2 backends, or from either tier's backend to frontends in general, which is a strong blocker for simply lifting the 120s limitation on the front caches naively....