Upgrade to Varnish 4: things to remember
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Feb 8 2016, 10:32 AM

Description

There are a bunch of things to test/check/keep in mind when it comes to the upgrade to Varnish 4. Let's collect them in this subtask.

figure out if varnish4 still has the problem that old VCL shared object continue to consume memory
remember to use std.ip() instead of ipcast.ip() while porting VCL code
check if varnishlog and varnishncsa are properly started by puppet
vmod-tbf sucked at startup time: "at daemon start, everyone's already at the limit". fix the module code and/or our parameter input to make sure we err on the lax side with ratelimits on fresh start
figure out how to make sure vmod-tbf deallocates the ip db memory on VCL reload
check for threads_failed. Set thread_pool_add_delay (now in floating-point seconds) if it grows too much
vslp has no per-backend weighting. We can't assume we won't need differential weighting of hardware generations so it might be necessary to add backend weighting as a feature to libvmod-vslp
do_stream defaults are reversed and do_stream lacks downsides - be sure to make per-cluster conditional do_stream code varnish3-only during VCL conversions, for later removal once we're all-varnish4.

Details

	Subject	Repo	Branch	Lines +/-
	Omit thread_pool_add_delay on Varnish 4	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Open	None	T124286 [Epic] Wikidata language support
Open	None	T134592 Allow setting the UI to a language other than English for anonymous users
Duplicate	Nikerabbit	T149419 Interface language selection for unregistered users on Wikimedia projects
Invalid	None	T114662 RFC: Per-language URLs for multilingual wiki pages
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	• ema	T108580 HTTPS for internal service traffic
Invalid	None	T109325 Outbound HTTPS for varnish backend instances
Open	None	T122867 Evaluate the feasibility of cache invalidation for the action API
Resolved	• ema	T122881 Install XKey vmod
Declined	None	T142841 Sideways Only-If-Cached on misses at a primary DC
Resolved	• ema	T131499 Upgrade all cache clusters to Varnish 4
Resolved	• ema	T126206 Upgrade to Varnish 4: things to remember
Resolved	• ema	T128788 Port varnishlog.py to new VSL API

Event Timeline

• ema created this task.Feb 8 2016, 10:32 AM

• ema claimed this task.

• ema raised the priority of this task from to Medium.

• ema updated the task description. (Show Details)

• ema added projects: SRE, Traffic.

• ema added subscribers: gerritbot, • ema, Southparkfan and 5 others.

Make sure that [[ https://github.com/wikimedia/operations-puppet/blob/24cc170e/modules/varnish/files/varnishlog.py | modules/varnish/files/varnishlog.py ]] (and the metric loggers which depend on it) and varnishkafka are compatible with changes to the VSL API.

varnishlog.py provides a binding for the VSL API in Python, using ctypes. python-varnishapi looks like a similar effort which supports Varnish 4.

• Johsthao closed this task as a duplicate of T126250: <spam>.Feb 8 2016, 6:24 PM

matmarex reopened this task as Open.Feb 8 2016, 6:32 PM

• ema updated the task description. (Show Details)Feb 10 2016, 2:51 PM

• ema set Security to None.

Change 269686 had a related patch set uploaded (by Ema):
Omit thread_pool_add_delay on Varnish 4

https://gerrit.wikimedia.org/r/269686

gerritbot added a project: Patch-For-Review.Feb 10 2016, 3:03 PM

Change 269686 merged by Ema:
Omit thread_pool_add_delay on Varnish 4

https://gerrit.wikimedia.org/r/269686

• ema updated the task description. (Show Details)Feb 17 2016, 11:20 AM

ori added a subtask: T128788: Port varnishlog.py to new VSL API.Mar 3 2016, 8:51 PM

I know that you guys have way more knowledge of (upgrading) Varnish than I do (and Wikimedia's needs are totally different compared to mine), but I already use Varnish 4.0 (which isn't that much different to 4.1) for a production wiki farm - and some stuff about my setup is available at https://meta.miraheze.org/wiki/Tech:Varnish (#Configuration for config). Feel free to reuse some of the docs or config if you can find something interesting there.

• ema closed subtask T128788: Port varnishlog.py to new VSL API as Resolved.Mar 31 2016, 1:51 PM

• ema updated the task description. (Show Details)Apr 1 2016, 12:52 PM

• ema moved this task from Backlog to Traffic team actively servicing on the Traffic board.

• ema added a project: Varnish.Apr 1 2016, 1:37 PM

• ema added a parent task: T131499: Upgrade all cache clusters to Varnish 4.Apr 1 2016, 1:57 PM

• ema removed a parent task: T122880: Evaluate and Test Limited Deployment of Varnish 4.

BBlack mentioned this in T131761: Solve large-object/stream/pass/chunked in upload cluster better.Apr 5 2016, 12:50 PM

BBlack updated the task description. (Show Details)

• ema moved this task from Traffic team actively servicing to Upcoming on the Traffic board.Apr 15 2016, 5:04 PM

Noticed doing v3 -> v4 upgrades on the new cache_maps hosts, I had to do the following to effect the upgrades:

Disable puppet on affected nodes
Update hieradata for varnish_version4 on them.
...

echo deb http://apt.wikimedia.org/wikimedia jessie-wikimedia experimental > /etc/apt/sources.list.d/wikimedia-experimental.list;
depool;
service varnish-frontend stop; service varnish stop; apt-get -y remove libvarnishapi1;
rm -f /srv/sd*/varnish*;
apt-get update;
puppet agent --enable;
puppet agent -t; puppet agent -t; puppet agent -t; puppet agent -t;
pool;

This procedure doesn't have to be perfect of course, since this is a one-shot transition as we go through the clusters. Maybe the wikimedia-experimental part can be fixed up, though.

• MZMcBride subscribed.May 9 2016, 2:32 PM

One more thing to remember when upgrading nodes above:

chmod 644 /var/lib/varnish/*/*.vsm
service ganglia-monitor restart

Instructions for downgrading nodes to varnish3 (trialed on cache_misc):

disable puppet on affected nodes:
Update hieradata to remove varnish_version4 on them
...

depool; sleep 30;
service varnish-frontend stop; service varnish stop;
rm -f /srv/sd*/varnish*;
rm -f /etc/apt/sources.list.d/wikimedia-experimental.list;
apt-get -y remove libvarnishapi1; apt-get -y autoremove; apt-get update;
puppet agent --enable; puppet agent -t; puppet agent -t; puppet agent -t;
service nginx reload; # for proxy_http_version
pool;

Mentioned in SAL [2016-05-17T08:13:43Z] <ema> upgrading eqiad cache_misc to varnish 4 (T126206, T134989)

Stashbot mentioned this in T134989: WDQS empty response - transfer clsoed with 15042 bytes remaining to read.May 17 2016, 8:13 AM

Mentioned in SAL [2016-05-17T08:31:56Z] <ema> upgrading codfw cache_misc to varnish 4 (T126206, T134989)

Mentioned in SAL [2016-05-17T09:02:23Z] <ema> upgrading esams cache_misc to varnish 4 (T126206, T134989)

BBlack moved this task from Upcoming to Backlog on the Traffic board.Sep 30 2016, 1:19 PM

• ema moved this task from Backlog to Varnish v4 on the Traffic board.Sep 30 2016, 2:43 PM

• ema updated the task description. (Show Details)Nov 24 2016, 3:20 PM

Going back over some of the unchecked boxes at the top:

Ratelimiting and general VCL reload memleak issues can be investigated separately in T163233
VSLP per-backend weighting is basically a non-issue. Our current config as well as our known plans for hardware refreshes and software changes over the next few years don't require it.

Which pretty much leaves nothing left to do here directly.

Upgrade to Varnish 4: things to rememberClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade to Varnish 4: things to remember
Closed, ResolvedPublic
Actions

Related Objects
Search...