Page MenuHomePhabricator

Upgrade to Varnish 5
Closed, ResolvedPublic


Varnish 5 is the first version to support HTTP/2, minus TLS termination. The Varnish team advises using their terminator, hitch. This sounds like a saner stack than combining Nginx and Varnish like we do now, and the HTTP/2 protocol would probably be better handled with two pieces of software that were designed to work with each other (technically, Varnish would be in charge of what matters in the protocol: assets prioritization, interleaving, etc.).

Unlike the upgrade to Varnish 4, this one looks a lot simpler, with few changes required to VCL:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We've discussed the V5 upgrade a few times in the past (although not much recently or explicitly), so I'll try to recap here some related thoughts:

  1. General Pain Level is Low - V4 -> V5 does in general seem like a relatively painless upgrade compared to our recent V3 -> V4 transition, but it's not going to be trivial. We can expect there will be undocumented changes to internal algorithms that we'll have to discover and account for, and usually porting our various analytics tooling is a pain point as well.
  2. Possibly helps backend storage scaling issues - There were some notable commits early in the V5 work which may help to alleviate the issues we've been having with the file disk storage engine for the backend varnishes in V4 (the infamous "mailbox lag" alerts). It's possible V5 will fare better here, although probably still not to the level of V3's persist storage engine nor the commercial V4+ mse engine.
  3. V5 VCL Switching is potentially big for us - Varnish 5's VCL-switching capabilities (out of vcl_recv based on request attributes) is potentially a big win for us on our VCL complexity issues, especially in the frontend daemons. It may also be the only sane way we can fully merge down cache_misc and reduce our cluster count from 3 to 2 for more cost and complexity efficiency at the edge.
  4. Priority conflicts with ATS evaluation - (see also T96853) - We have at multiple times in the past deferred the long-term-view work on evaluating ATS to replace at least some of our Varnish usage, in favor of short term work to improve our Varnish installation. We're hoping in the coming FY to devote more resources to the ATS effort and potentially at least migrate our backend caches to it, which obviates the concerns above about disk storage engines. There has been some forward progress on ATS evaluation over the past few months, and things seem hopeful. If we start a major project to convert all existing V4 daemons to V5 first, it will further delay that work.
  5. Varnish still has long-term issues for us - This brings us around to the reasons we really want to work on evaluating Varnish alternatives: there are still a number of long-term / high-level issues with Varnish as a solution here. The key ones are the lack of TLS termination and a viable large disk storage engine in the open source product. The former has to do with politics and opinions mostly, and the latter has a lot to do with Varnish's "Open Core" strategy which is increasingly leaving the Open Source variant of Varnish at a "hobby" implementation level because their "Open Core" strategy seems to assume only large enterprises that purchase commercial variants should have large-scale needs.
  6. Hitch and TLS in general - Hitch (which is Varnish's takeover and rename of the legacy open source Stud proxy) is also, last I looked, deficient in its TLS implementation features compared to the level of support we have with our current (and admittedly, mildly patched for TLS features) nginx termination. Even if we ignore H/2 issues, switching from nginx to any alternative TLS termination proxy (haproxy is another often discussed) is a complex transition in itself and involves a lot of testing and work at the boundary layer between the proxy and the cache. Our preference here is that the next TLS termination solution we invest in significantly is one that will serve us for the long term regardless of the Varnish-level issues. We do have a desire to move off of nginx in general (in part because of their own "Open Core" development model issues...), but where we move to next is a harder question. The last round of evaluation I did on all of the truly open source potential TLS terminators that could operate for us at scale concluded that one of our best options was (surprise!) ATS. While our plans at the caching level for ATS mostly revolve around starting at the backend-most caches and only considering replacing our Varnish frontends at a later date when we're more-comfortable (as that's a much more complex transition), an ATS daemon configured without caching turns out to be a potential winner as a TLS proxy into varnish in terms of features and scaling, and also blends well into potential future plans to then slowly move our Varnish frontend functionality up into it progressively (starting with the Analytics portions).

HTTP/2 implementation differences and such are another area of concern here, but that's already spawned a lengthy conversation on IRC and I'm not sure I can summarize any of that succinctly at this point except to say that the HTTP/2 landscape is still young and evolving.

I'm not yet sure where all of the above leaves us on long-term planning for major projects this next FY, but I'm already in general trying to untangle that mess and make some decisions this week, so more to come later I'm sure.

ema triaged this task as Medium priority.Jun 22 2017, 10:44 AM

This task hasn't been updated for various IRC/Hangouts discussions since. We did decide to move forward with V5 upgrades. Arzhel has built preliminary packages, and we have a goal this quarter to upgrade at least one production cache cluster to V5.

Is the plan to use it with hitch in front of it rather than nginx? Or just an upgrade for now and we'll see about that part later?

Definitely not looking at Hitch presently. Just swapping out Varnish4 for Varnish5 in the existing software stack for both the frontend and backend cache processes. Forward-looking, some of our next steps in this area might be to replace the varnish backend processes and/or nginx with ATS, but even if we make quick progress on both fronts, we'll likely still have a Varnish frontend cache daemon in the middle for quite some time in the future, which is enough to justify not falling behind on the Varnish revs.

So, packages-wise, I've packaged and uploaded varnish-modules 0.12.1 to Debian unstable. @ayounsi's packaging of varnish5 and libvmod-netmapper looks great and he's currently in the process of pushing his changes to gerrit. Once that's done and the changes are merged we should:

  • build the new varnish 5 packages and upload them to apt.w.o (experimental)
  • backport varnish-modules to jessie-wikimedia (simple changelog bump and rebuild against the new varnish 5 packages)
  • build libvmod-netmapper and varnishkafka against the new v5 packages

Once all that is built and fine we can upload the results to apt.w.o (experimental), upgrade cp1008, see it fail and start with the VCL changes!

pristine-tar and upstream pushed to origin

debian-wmf pushed to gerrit: Imported Upstream version 5.1.3 Adapt WMF specific patches for Varnish5 Update changelog for Varnish 5.1.3 Varnish5: Install devicedetect.vcl in the proper path Varnish5: Update libvarnishapi1.symbols

libvmod-netmapper pushed to gerrit as well.

libvmod-netmapper / varnish pushed to the experimental apt repo.

varnish-modules backported and uploaded to experimental.

In my comment above I forgot to mention libvmod-vslp, which conveniently we do not need to care about any longer as it's been added to varnish since version 5 .0 as the shard director. VCL-wise, this means dropping the import vslp statements and using directors.shard() instead of vslp.vslp(). The shard director does not provide a init_hashcircle function, so that needs to be removed as well (or we need to figure out what the shard alternative is).

Another important change with varnish 5 is the removal of the session_max parameter, which is useless since v4.

Oh, and s/return (fetch)/return (miss)/ in vcl_hit. It doesn't look like we ever return fetch in vcl_hit, but it's good to keep this in mind.

Change 382420 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: stop passing session_max

Change 382420 merged by Ema:
[operations/puppet@production] varnish: stop passing session_max

Change 382464 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: support for version 5

One thing that I noticed now from is the following:

Varnish Cache 5.2-RC1 (2017-09-04)
C APIs (for vmod and utility authors)

The VSM API for accessing the shared memory segment has been totally rewritten. Things should be simpler and more general.

From it seems that varnishkafka and all the other python daemons will need to get reviewed (already seeing commits for 5.2 in though, that it is nice).

Is there a plan for 5.2? I am asking since I'd need to make some tests and possibly schedule time to update varnishkafka's code :)

Nice and positive thing is that the VUT library is finally a first class citizen of the codebase so it should be easier this time to update the code.

We're moving to 5.1.3 with this upgrade. 5.2.0 is a little too bleeding-edge for now :)

all the other python daemons will need to get reviewed (already seeing commits for 5.2 in though, that it is nice).

Your comment reminded me that after the transition to cachestats.CacheStatsSender we actually don't use (hence python-varnishapi) anymore! I've confirmed this with cumin 'cp*' 'grep varnishlog /usr/local/bin/*' and grepping for varnishlog in our puppet repo.

There's gonna be lots of stuff to remove on Monday. :-)

Change 382464 merged by Ema:
[operations/puppet@production] varnish: add support for version 5

Change 383369 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish reload-vcl: avoid using columns in vcl_label

Change 383369 merged by Ema:
[operations/puppet@production] varnish reload-vcl: avoid using columns in vcl_label

Change 383533 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish reload-vcl: ensure VCL name starts with a letter

Change 383533 merged by Ema:
[operations/puppet@production] varnish reload-vcl: ensure VCL name starts with a letter

So, thinking ahead past cache_misc and assuming that's successful, probably the next target should cache_upload (lower complexity than text, and in more need of potential eviction improvements). With both of the other clusters, but especially upload, we'll face some load issues with the vslp->shard shift as we work through a DC (and the backend misses to remote DCs). With the current MISS2PASS behaviors this is somewhat minimized for cross-DC, though. Probably a sane-ish plan for the bigger clusters is to work from the bottom up first in the codfw direction, and then in the eqiad direction (so: codfw, ulsfo, eqiad, esams). Within each DC, we'll probably want to move through the nodes as quickly as reasonably allowed by load to avoid the worst of the effective backend storage reductions due to cross-chashing.

Change 409047 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] wmf-upgrade-varnish: initial release

Change 409047 merged by Ema:
[operations/puppet@production] wmf-upgrade-varnish: initial release

Change 410558 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] wmf-upgrade-varnish: add support for non-interactive upgrades

Change 410558 merged by Ema:
[operations/puppet@production] wmf-upgrade-varnish: add support for non-interactive upgrades

ema claimed this task.