Page MenuHomePhabricator

varnish-be returning 503s upon VCL reload
Closed, ResolvedPublic

Description

I've rebooted cache_text yesterday as part of T155401 and observed a few 503 spikes although the reboots were depooled and with enough spacing between each server. Today I've tracked down the issue and found that the errors are reproducible with a simple backend VCL reload performed with /usr/share/varnish/reload-vcl. Note that there is no need for actual VCL changes in order to reproduce the behavior, a noop reload would also make the varnish backend throw 503s for about one second.

The issue is known https://github.com/varnishcache/varnish-cache/issues/2195 and is due to dynamic backends introduced in 4.1. Before 4.1 different VCLs were sharing the same backends, including the health probes and their status. Starting with 4.1 instead, different VCLs have their own backends, meaning that a new VLC will start with all the probes reset. Our probe settings are such that backends are considered sick upon startup, and need one successful poll in order to become healthy.

This might shed some light on T154801 too, as all backends reloading (and thus returning 503s at the same time) could have triggered the vslp error resulting in varnishd crashing.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Feb 7 2017, 11:13 AM

I've confirmed on cp3040 that the issue is not reproducible by doing either of the following:

Change 336440 had a related patch set uploaded (by Ema):
varnish: warm-up before switching to new VCL in reload script

https://gerrit.wikimedia.org/r/336440

Change 336440 merged by Ema:
varnish: warm-up before switching to new VCL in reload script

https://gerrit.wikimedia.org/r/336440

ema claimed this task.

Tested https://gerrit.wikimedia.org/r/336440 on a esams host, no 503 spikes. Closing.

TL;DR - Current solution is a fixed 2s load->use delay. I think we should probably do more here at this point, especially in light of eqsin and other thinking. Probably probes should be enhanced, and the sleep parameterized to match.

Given the varnish probe we use for v<->v checks (the only probe we use) has current params:

.timeout = 500ms;
.interval = 100ms;
.window = 3;
.threshold = 2;

The bare minimum time for a backend to become initially healthy would be 100ms (interval between first two checks) plus the time to execute/fail those first two checks, which in the timeout case could be an additional 1000ms, so sleep 2 should suffice in some basic sense. It wouldn't be shocking if some probes initially fail during the inrush of fresh probe requests on VCL start, though.

In general, I suspect we should change our probe definition a bit, as 2 successes out of a window of 3 is rather short and susceptible to minor spiking, and the 500ms timeout is a bit edgy for our furthest-flung edge DC (eqsin->eqiad at ~250ms RTT just for a ping currently).

I randomly observed (without trying too hard!) an eqsin->eqiad probe with a window of 3/8 failures showing in its history under "normal" conditions (also note the high avg response time for probe success, in the ballpark of 2xRTT, which makes sense and also happens to be the fixed probe timeout...):

vcl-root-f067676c-a138-4561-b5a8-ea37013595b4.be_cp1050_eqiad_wmnet probe      Healthy             3/3
  Current states  good:  3 threshold:  2 window:  3
  Average response time of good probes: 0.490244
  Oldest ================================================== Newest
  444444444444444444444444444444444444444444444444444444444444444- Good IPv4
  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX- Good Xmit
  ------------------------------r--r---r-------------------------- Error Recv
  RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR-RR-RRR-RRRRRRRRRRRRRRRRRRRRRRRRR- Good Recv
  HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-HH-HHH-HHHHHHHHHHHHHHHHHHHHHHHHH- Happy

Another backend is persistently-sick from eqsin->eqiad because it persistently falls just past the 500ms cutoff by a small amount... (cp1062, I wonder why it's a few ms slower?)

As an underlying step, we should hiera-configure a per-DC value named something like max_rtt_to_core for "worst-case expected RTT to a core DC". This is our expected worst RTT to the furthest of the two core DCs, using our worst-case network links to get there, probably with ~10% padding added on top for variance.

I'd suggest we should change our probes:

  • 3/5 instead of 2/3 to be a little more reliable and accurate
  • set the be->be probe timeout to (max_rtt_to_core * 3) + 100ms, since we observe ~2xRTT as "normal", and in the case of very low RTTs like eqsin/eqiad this pads 100ms on top just to handle very minor spikes that might be local to either machine (or either varnishd...).
  • set the fe->be probe timeout much smaller, say a fixed 100ms (the above formula with rtt considered to be 0)?

Then set a matching sleep delay (actually, we should set this first before deploying the above):

  • adjust the vcl load->use sleep delay to match on a per-instance basis to cover the whole 5-probe window with maximal timeout (plus 4x100ms intervals) and round up to the next whole second, so basically round_up((max_rtt_to_core * 3 + .1) * 5 + .4). If we assume our eqsin's max_rtt_to_core setting is 300ms, then the vcl load->use sleep delay there would be 6s.
  • While we're at it, probably sanity-check how these new values work out vs the fixed sleeps in the varnish-x-restart scripts, in case they need adjusting as well. Whatever the max sleep works out to be, a restart of a core backend has to assume that as the very minimum (plus etcd delays, plus reasonable connection-drain) between depooling itself and stopping a daemon.

Change 413180 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add hiera max_core_rtt data

https://gerrit.wikimedia.org/r/413180

Change 415204 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] reload-vcl refactors/improvements

https://gerrit.wikimedia.org/r/415204

Change 415205 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Make inter-varnish probes great again

https://gerrit.wikimedia.org/r/415205

Change 413180 merged by BBlack:
[operations/puppet@production] Add hiera max_core_rtt data

https://gerrit.wikimedia.org/r/413180

Change 415204 merged by BBlack:
[operations/puppet@production] reload-vcl refactors/improvements

https://gerrit.wikimedia.org/r/415204

Change 415205 merged by BBlack:
[operations/puppet@production] Make inter-varnish probes great again

https://gerrit.wikimedia.org/r/415205

Change 415900 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Add hiera max_core_rtt data for labs

https://gerrit.wikimedia.org/r/415900

Change 415900 merged by Ema:
[operations/puppet@production] Add hiera max_core_rtt data for labs

https://gerrit.wikimedia.org/r/415900