Page MenuHomePhabricator

Fix Varnish TTLs across the board
Closed, ResolvedPublic

Description

https://phabricator.wikimedia.org/T102991 highlighted that we had cached objects lasting longer than 30 days, which was both problematic and unexpected. That specific issue has been fixed: https://gerrit.wikimedia.org/r/#/c/229714/ , but in researching this I realized there are actually a lot of problems with how we're handling cache TTLs, especially with how they're handled across layers and tiers of caching. The key issues here are:

  1. We're limiting frontends to 120s object lifetimes in the common case - reduces front hitrate, and also explains hitrate anomalies on cache size increase during earlier experimentation here: https://phabricator.wikimedia.org/P969
  2. We're not really communicating TTLs properly from Tier1 backends to Tier2 backends, or from either tier's backend to frontends in general, which is a strong blocker for simply lifting the 120s limitation on the front caches naively....

Event Timeline

BBlack raised the priority of this task from to High.
BBlack updated the task description. (Show Details)
BBlack added a project: Traffic.
BBlack subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ok I've dug into this some (read varnish source code to confirm behavior there, re-read our VCL, stared at lots of parsed varnish logs, etc) and it's not as bad as I initially thought. Most of it's working pretty sanely, actually, and TTLs are usually correctly transitive across the layers.

The only strange case is responses with no cacheability info that use the default_ttl, which should probably be synchronized across the layers of a cluster (as in, bump upload/text frontends to 30d default). Very few text-cluster requests are using the default anyways, most are using app-supplied TTLs.

Additionally, mobile should be using the same 30d default_ttl that text/upload uses (right now it's using 120s for both layers).

Change 230808 had a related patch set uploaded (by BBlack):
cache_(text|upload): frontend default_ttl => 30d

https://gerrit.wikimedia.org/r/230808

Change 230809 had a related patch set uploaded (by BBlack):
cache_mobile: def_ttl 30d

https://gerrit.wikimedia.org/r/230809

Change 230808 merged by BBlack:
cache_(text|upload): frontend default_ttl => 30d

https://gerrit.wikimedia.org/r/230808

Change 230809 merged by BBlack:
cache_mobile: def_ttl 30d

https://gerrit.wikimedia.org/r/230809

BBlack claimed this task.
BBlack raised the priority of this task from High to Unbreak Now!.
BBlack moved this task from Backlog to Done on the Traffic board.