Page MenuHomePhabricator

Add jitter to BagOStuff TTLs
Open, LowPublic

Description

Fixed TTLs (time to live) can lead to thundering herd problems, since groups of application servers are often restarted simultaneously.

We saw this recently with SyntaxHighlight's Pygmentize::getVersion(), which determines the version of Pygments by shelling out to Python via Shellbox (~200ms) and caches the result in APC with a one-hour TTL, resulting in regular stampedes on Shellbox whenever the key expired for a group of app servers.

Rather than repeatedly rediscover this problem, we can make BagOStuff subtract some slight, random amount of time from TTLs by default. Adding jitter may be unsafe because users of the interface may be relying on cached values never being older than the TTL, but subtraction should be fine, since users already have to contend with the possibility of values falling out of the cache before their expiration due to the cache layer's replacement policy.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

<Krinkle> WANCache does this by default in so far that it randomly decreases the TTL at "get"-time through pre-emptive regeneration.
<Krinkle> but we don't have something like it for raw BagOStuff usage, such as for php-apcu.
<Krinkle> we could do something like... if the TTL is more than a minimum threshold, decrease it by a small random proportion of the TTL, and only for a small random portion of when we hit the "set" path.

Thus effectively creating the same effect, though unlike WANCache, it'll be sync instead of async, but that's fine and well within the contract, e.g. as if it was evicted earlier before we knew. Afaik we don't expose the current TTL anywhere. We do have changeTTL. I'm thinking if this could e.g. some some code to get stuck in a loop for a while trying to increase the TTL. I'd hope no such code exists..

If we want to be conservative, we could start with having it as a feature flag. Not opt-in per-command caller, but for the LocalServerObjectCache service object (e.g. the APCUBagOStuff instance) as a whole.

It could be implemented via metaget...but that's a ways in the future.

@aaron I'm not sure what the connection is with metaget. I believe this task could be solved by having BagOStuff deduct a light mt_rand range from the given $ttl (ensuring not to lower it below a minimum threshold so that things like 1s or 2s TTLs are left as-is). It seems we'd need that regardless for php-apcu which isn't Memcached, and this task is mainly motivsted by the php-apcu use case.

How would/could we do this with metaget for memcached?

Metaget lets you get the remaining time-to-live from memcached itself (which could be fudged with a random subtractor). Randomizing the TTL would still make all threads see a key expire at the same time (for apcu, all threads on a host).

Metaget lets you get the remaining time-to-live from memcached itself (which could be fudged with a random subtractor). Randomizing the TTL would still make all threads see a key expire at the same time (for apcu, all threads on a host).

Ack, managing every thread separately could in theory help, but have we seen that become an issue in practice? For particularly risky keys, we have numerous opt-in mechanisms to avoid that, and generally it looks like we're moving toward slightly smaller appservers in terms of threads/workers per php-fpm instance (with k8s but also more generally given the bottlenecks we see).

Formulising what we have and making it work by default for all keys as substraction seems like a solid step forward with virtually no change in complexity and no change in public API, yet getting almost all the gain there is to be had.

Oh, I missed that this was apcu to begin with. Yeah, that could be done for APCUBagOStuff for medium-high TTL keys easily enough.