Page MenuHomePhabricator

Stop using persistent storage in our backend varnish layers.
Closed, ResolvedPublic

Description

Taking the action in the title is relatively trivial. The point of this ticket is to outline the rationale and invite feedback that might change our course of action.

Background

We've always used the persistent storage engine for our important Varnish backend caches to date. The simple upsides of persistence are obvious: the cache doesn't have to be rebuilt from scratch at a high miss-rate cost in any of a number of scenarios where varnishd or the kernel/OS crash or need planned restarts. By using the blanket protection of persistence against many adversarial conditions, for the most part we don't even have to think about the details of many of these corner cases. However, there are a growing number of downsides to our use of persistent storage, and that warrants exploring its trade-offs in greater depth.

Persistence downsides

  • It's been deprecated upstream. phk wrote about its deprecation a little over two years ago, and nothing has really changed since then: https://www.varnish-cache.org/docs/trunk/phk/persistent.html . Continuing to rely on this deprecated feature is probably a dangerous path going forward. It will be hard to get any real support for any future persistent storage issues we may encounter in Varnish 4.
  • The persistent storage engine has always been the most complex storage engine in Varnish in code terms. All other storage engines are far simpler, and thus likely to have fewer bugs and better performance. For comparison, persistent is ~59K raw lines of code, whereas 'file' is ~14K (so about 4:1).
  • To that point, we do see varnishd backend crashes related to persistence as well, which sometimes require wiping the persistent store anyways. These have gone up in rate in Varnish4, which should be expected I suppose, post-deprecation.
  • Sometimes persistence just doesn't work (has a consistency bug on restart, due to some condition on shutdown/crash). It doesn't happen often in normal scenarios, but it can happen, meaning this isn't fool-proof.
  • We're already maintaining multiple varnishd patches for the persistent engine to make it real-world usable for us, which is a road we'd rather not have to go down in the long term (e.g. to fix the crashers and keep up with future development in this deprecated code). If it weren't for these persistence-specific patches, we could be running a completely stock debian Varnish4 package today.
  • It doesn't work with xkey unless we significantly hack both xkey and varnishd, because xkey's indices are memory-only.
  • While it does work with bans and purges, there's always the issue that a host/varnishd restart can lose purges or bans that happen while it's down. We tend to be aware of this, but it's an issue that would go away.

Failure scenarios

As mentioned earlier, the upsides are about smoother handling of various failure scenarios. To argue about persistence's utility, we have to look at various classes of failure scenario in depth.

  1. Single-machine: It's a given that persistence doesn't matter much in single-machine scenarios. Losing one machine's storage has negligible impact given how robust our layering and tiering is.
  2. Whole-DC: Even losing an entire datacenter is manageable, again thanks to our robust layering and tiering and geodns user remapping. In the worst of cases (e.g. powerfail in an active primary DC), we simply have to remap the failed DC's direct users in geodns to other surviving datacenters with larger sets of existing backend cache while the rebooted datacenter recovers its storage at reasonable rates from the true misses. We would do that anyways at the start of the failure, so it's just a question of how quickly we can move them back afterwards.
  3. Global sudden power outage: This is theoretically a problem, but basically the risk of this is so low that it shouldn't be a decision factor. We have one datacenter that's completely failure-separated from the rest in Amsterdam. The three other datacenters in the US have a high degree of failure-diversity: each is in a separate national-level power grid (Eastern/Western Interconnects and ERCOT), they're operated by separate companies (Equinix, CyrusOne, UnitedLayer), and they're physically separated enough (opposite coasts and Texas) that they don't share natural disasters, either. If such a scenario somehow did occur, the fallout for the rest of our stack (and probably most other internet services) would be dramatic enough that we probably wouldn't be too concerned by some additional cache warming time in the aftermath.
  4. Global software crash/restart: this is far more likely than global power loss, and contains all the real cases to argue about, and should be broken up into sub-cases for argument:
    1. Crash (Organic) - This is the scenario where some hidden bug is accidentally triggered globally on all caches nearly-simultaneously. An example might be a Linux kernel bug that causes all kernels to crash at a certain future date (e.g. a y2036 bug or similar). These scenarios are definitely possible, and impossible to universally prevent. However, they're very rare, and they're going to cause an outage with or without persistence. It's only a question of persistence potentially reducing the duration/impact a little. Any such traffic-induced bug with varnishd (as opposed to the kernel) would fall under the same layer-protection arguments outlined for an exploitative crash below:
    2. Crash (Exploitative) - Someone finds a varnishd or kernel -crashing bug that can be triggered by traffic, and uses it to crash all of our cache infrastructure nearly-simultaneously, across all layers/tiers. While this seems scary, I think it's unlikely. The caches are isolated from the public network; their inbound traffic only comes through designated LVS ports and protocols. Kernel bugs that crash on regular HTTP network traffic are very rare in the long view. Especially ones that would fail to crash LVS, but still crash the cache machines behind them. I wouldn't expect to ever face one, or plan anything around it. A varnishd-crashing request is more likely, but our layers/tiers tend to protect us here. In order to crash out multiple layers of varnish, the bug would have to be of a very specific nature: the exploitative/buggy request would have to not crash varnish on the inbound side (passing through as a request to the backend), but cause varnishd to crash on the response side. Additionally, it would have to crash after successfully sending the response to that varnishd's "client", not just while receiving it or processing it internally. That's the only way a single request can wipe out all layers in one go. Otherwise only the front-most or back-most layer crashes, and the post-restart misses are covered by other layers or tiers.
    3. Operational (mistake) - We do something very dumb with a salt command or puppet commit, causing all varnishds to restart immediately or all cpNNNN machines to reboot immediately. This is definitely a scenario where persistence would help with recovery. However, it's also definitely a global outage event, it's just a question of recovery time. Also, even with persistence there are still many dangerous ways we can make stupid mistakes, and we can't prevent them all. To date, this has never happened AFAIK.
    4. Operational (intentional) - We may have a need to reboot all the cache machines, or restart all the varnishds, in a very short window of time to fix a very critical security flaw. We don't yet have as much hard data as I'd like on the exact timing we could handle without persistence. We do have some datapoints as hints, though: we've restarted all global frontends in ~30 minutes recently without issue with their respective backends covering the loss and no user-visible fallout. We know that if the frontends are up and running fine with full caches, it's generally safe to wipe all the backend storage at once (as the frontend miss rate is relatively-low and fills in fairly painlessly thanks to chashing and coalescing). That combined tells us we could manage restarting all varnishd daemon instances over the course of at most a few hours by staggering through the frontends first, if the host/kernel stayed up throughout. Kernel-level restarts would have to be a little slower, but I'd still estimate ~6 hours or less if we work through it intelligently one machine at a time and assume ~5 minutes per reboot on average for the critical cache clusters, and start at the outer-most DC tiers. In these scenarios we're not forced to wipe an entire layer in one go - we stagger the miss-refill as machines reboot.

Analysis and Future improvements

Most of the bad scenarios where persistence might help above are statistically-rare. They're the kind of things that, in the aggregate, might happen once on a scale of years. We have several possible future avenues for improving these scenarios over time which don't involve varnish's deprecated persistence. As we work through them in our future plans, the odds of these scenarios impacting us will continue to drop. Thus I tend to think that on the balance of things, we're best off dropping persistence now while switching to Varnish 4. Foremost in my mind for future improvements are these:

  1. Imminent esams link improvements ( T136717 ) - The one scenario that's bit us the most in the past is the specific case of refilling all disk caches in esams at once, as it tends to be network-intensive over one of our worst (highest latency, and currently lowest quality) links. I don't recall if we actually temporarily saturated it, or just went over our commit and got charged money (possibly both). With the improvements in the pipeline here, esams networking should be less an issue.
  2. Active-Active support for the whole stack ( T88445 ) - We're still a fairly long road out from this, I think, but as we get there it dramatically reduces the impact of active-primary-DC events. In the final design, the two 'halves' of our cache DC network will essentially be split (ulsfo->codfw->app, esams->eqiad->app) and not share a common path to the applayer, modulo the sideways cache pulls below...
  3. Sideways Only-If-Cached requests between the pair of core DCs ( T142841 ) - This addresses some of the same concerns as the above during failure scenarios, but we'll get there faster than full active:active at the MW level. Once it works between the core DCs, we may look at similar things with the cache-only DCs as well, in the cases where it makes latency sense at all.
  4. Apache Traffic Server ( T96853 ) - This has been on the long-term radar, but untouched, for a long time now. In more-recent IRC discussions, I've brought up the idea that we may be able to transition just our backend layers to ATS as a first (and possibly only) move, and evaluate switching frontends away from Varnish at a later date. The idea there is that most of our ridiculous VCL complexity is only in the frontend (memory) layer. That's where analytics, geoip, request mangling, DoS-prevention, etc takes place. Our backend VCL in isolation is relatively tame (just a raw cache with chashing, and some minimal applayer-facing request mangling) and easy to port to other solutions. By having ATS in the backend with Varnish in the front, we could get the best of both worlds. ATS has persistent storage that works, and also has better ways to do sideways cache checks as in (2) above. Also, with two different pieces of software in our layered cache stack, the odds that a single buggy request can wipe out all layers drops to essentially zero.
  5. Asia Cache DC and beyond - as we slowly add more cache DCs, we slowly also become more resilient to all varieties of adverse events.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

(Added two forgotten items (xkey + purge/ban) in the downsides list)

Note we'd like to make a call on this before converting cache_upload to v4 in T131502 (and presumably switching to file as part of the process). If anyone has any contrarian ideas on why we really need to stick with persistence, even if they're half-formed, please respond here before we start moving forward!

I should add another overall point here about TTLs:

One key thing that will become unblocked post-Varnish4 (so, early CY2017) is reducing our normal (other than "grace" for outages/incidents) TTLs to 24h max in T124954. When that happens successfully (which it seems like it will), that puts an upper bound at 24H for the longest time we'd have to take to reboot/wipe all caches without having any real impact on traffic stats for a necessary security update. I still think in practice we'll be closer to ~6hrs max, this sets a 24H upper bound on how long that could possibly be if we're wrong.

Change 306180 had a related patch set uploaded (by Ema):
cache_upload: switch to file storage backend on Varnish 4

https://gerrit.wikimedia.org/r/306180

Change 306180 merged by Ema:
cache_upload: switch to file storage backend on Varnish 4

https://gerrit.wikimedia.org/r/306180

BBlack claimed this task.

We're past this decision point now. There are issues with file storage in Varnish4 as well, but mitigating those with size-classing hacks and periodic restarts (and eventually, further mitigating them with upgrades to Varnish5) seems simpler than trying to wade through the outstanding crashers observed with the persistent engine under upload's traffic levels/patterns with Varnish 4.