Currently we set a hard cap on object lifetime at 30 days in our VCL for all clusters (in addition to a few tighter restrictions in certain cases). I think we can/should reduce this lifetime if we can.
Possible Concerns
- Obviously, cache hitrate could be negatively impacted. However, I suspect this isn't a big problem in practice. If we end up reducing some long-lived objects from 30 days to, say, 14 days, the effective hitrate if the object is very hot is virtually unchanged. For example, if it's requested once per second and virtually never changes, we've gone from from an effective hitrate of 99.9999614% to 99.9999173%. The less hot an object is, the less it matters for overall perf/hitrate averages anyways.
- Long-lived objects help protect us in certain operational corner cases. The principle example is taking a cache cluster offline from live traffic for multiple days (e.g. due to network link risks), and then bringing it back online later without wiping (because the link was never actually down, and purges were flowing fine). In that scenario, the cache will effectively wipe itself anyways if the downtime exceeds the lifetime of most (or all) objects.
The upside is that by reducing the maximum cache lifetime, we reduce concerns and headaches related to stale objects (or at least, fears of very-stale objects) from code/asset deployers. In other words, we're able to provide a tighter guarantee of the form "Even if all else goes wrong with invalidation, nothing in this cache can possibly be older than X".
I'd like to propose that we come down first from 30 to 21 days, wait a month to make sure we've seen the effects, and then move down to 14 days, and remain at that value for the foreseeable future.
I've taken a few stats sample so far (single cache host, ~10 minute samples) to get some preliminary ideas. On the upload cluster, I'm seeing a rate of served Age: headers >= 86400 (1 day) at 0.01% of responses. On the text cluster, it maps out like:
1s+: 99.70% (age < 1s: 0.30%)
1m+: 90.71% (age < 1m: 9.29%)
1h+: 53.85% (age < 1h: 46.15%)
1d+: 37.33% (age < 1d: 62.67%)
7d+: 12.37% (age < 7d: 87.63%)
14d+: 0.70% (age < 14d: 99.30%)
21d+: 0.67% (age < 21d: 99.33%)
[original figures in description here were flawed, these are more-valid numbers]