Page MenuHomePhabricator

Upgrade to Varnish 4: things to remember
Closed, ResolvedPublic

Description

There are a bunch of things to test/check/keep in mind when it comes to the upgrade to Varnish 4. Let's collect them in this subtask.

  • figure out if varnish4 still has the problem that old VCL shared object continue to consume memory
  • remember to use std.ip() instead of ipcast.ip() while porting VCL code
  • check if varnishlog and varnishncsa are properly started by puppet
  • vmod-tbf sucked at startup time: "at daemon start, everyone's already at the limit". fix the module code and/or our parameter input to make sure we err on the lax side with ratelimits on fresh start
  • figure out how to make sure vmod-tbf deallocates the ip db memory on VCL reload
  • check for threads_failed. Set thread_pool_add_delay (now in floating-point seconds) if it grows too much
  • vslp has no per-backend weighting. We can't assume we won't need differential weighting of hardware generations so it might be necessary to add backend weighting as a feature to libvmod-vslp
  • do_stream defaults are reversed and do_stream lacks downsides - be sure to make per-cluster conditional do_stream code varnish3-only during VCL conversions, for later removal once we're all-varnish4.

Related Objects

Event Timeline

ema claimed this task.
ema raised the priority of this task from to Medium.
ema updated the task description. (Show Details)
ema added projects: SRE, Traffic.
ema added subscribers: gerritbot, ema, Southparkfan and 5 others.

varnishlog.py provides a binding for the VSL API in Python, using ctypes. python-varnishapi looks like a similar effort which supports Varnish 4.

ema set Security to None.

Change 269686 had a related patch set uploaded (by Ema):
Omit thread_pool_add_delay on Varnish 4

https://gerrit.wikimedia.org/r/269686

Change 269686 merged by Ema:
Omit thread_pool_add_delay on Varnish 4

https://gerrit.wikimedia.org/r/269686

I know that you guys have way more knowledge of (upgrading) Varnish than I do (and Wikimedia's needs are totally different compared to mine), but I already use Varnish 4.0 (which isn't that much different to 4.1) for a production wiki farm - and some stuff about my setup is available at https://meta.miraheze.org/wiki/Tech:Varnish (#Configuration for config). Feel free to reuse some of the docs or config if you can find something interesting there.

Noticed doing v3 -> v4 upgrades on the new cache_maps hosts, I had to do the following to effect the upgrades:

  1. Disable puppet on affected nodes
  2. Update hieradata for varnish_version4 on them.
  3. ...
echo deb http://apt.wikimedia.org/wikimedia jessie-wikimedia experimental > /etc/apt/sources.list.d/wikimedia-experimental.list;
depool;
service varnish-frontend stop; service varnish stop; apt-get -y remove libvarnishapi1;
rm -f /srv/sd*/varnish*;
apt-get update;
puppet agent --enable;
puppet agent -t; puppet agent -t; puppet agent -t; puppet agent -t;
pool;

This procedure doesn't have to be perfect of course, since this is a one-shot transition as we go through the clusters. Maybe the wikimedia-experimental part can be fixed up, though.

One more thing to remember when upgrading nodes above:

chmod 644 /var/lib/varnish/*/*.vsm
service ganglia-monitor restart

Instructions for downgrading nodes to varnish3 (trialed on cache_misc):

  1. disable puppet on affected nodes:
  2. Update hieradata to remove varnish_version4 on them
  3. ...
depool; sleep 30;
service varnish-frontend stop; service varnish stop;
rm -f /srv/sd*/varnish*;
rm -f /etc/apt/sources.list.d/wikimedia-experimental.list;
apt-get -y remove libvarnishapi1; apt-get -y autoremove; apt-get update;
puppet agent --enable; puppet agent -t; puppet agent -t; puppet agent -t;
service nginx reload; # for proxy_http_version
pool;

Mentioned in SAL [2016-05-17T08:13:43Z] <ema> upgrading eqiad cache_misc to varnish 4 (T126206, T134989)

Mentioned in SAL [2016-05-17T08:31:56Z] <ema> upgrading codfw cache_misc to varnish 4 (T126206, T134989)

Mentioned in SAL [2016-05-17T09:02:23Z] <ema> upgrading esams cache_misc to varnish 4 (T126206, T134989)

Going back over some of the unchecked boxes at the top:

  1. Ratelimiting and general VCL reload memleak issues can be investigated separately in T163233
  2. VSLP per-backend weighting is basically a non-issue. Our current config as well as our known plans for hardware refreshes and software changes over the next few years don't require it.

Which pretty much leaves nothing left to do here directly.