Page MenuHomePhabricator

Varnish and ATS health-check improvements
Closed, ResolvedPublic

Description

@CDanis noticed that ats-be is considering all requests to /check as internal health-checks, regardless of what the Host header says. While looking at this problem I've seen a bunch of related issues that should be fixed.

There are three different type of diagnostic requests that Varnish historically used to handle. The first were requests issued by varnish-fe to varnish-be to check the cache backend layer status. They look like this: http://varnishcheck/check. The second are checks coming from PyBal and directed, logically speaking, to varnish-fe. PyBal requests target https://varnishcheck.wikimedia.org/from/pybal: they go through the TLS termination layer (once nginx, now ats-tls) and then hit varnish-fe. The third are Icinga requests targeting varnishcheck/wikimedia-monitoring-test and hitting varnish-fe directly. Confusingly, our VCL handles all these requests as follows in wm_recv_early. This VCL code used to be shared by both Varnish frontends and backends, and worked for both given that the Host header does indeed start with "varnishcheck" in the case of http://varnishcheck/check, https://varnishcheck/wikimedia-monitoring-test and https://varnishcheck.wikimedia.org/from/pybal:

if ( req.http.host ~ "^varnishcheck" ) {
    return (synth(200, "healthcheck"));
}

To make things less confusing:

  • ats-be should only consider requests to //varnishcheck/check as health-checks performed by the frontend layer, ie check the Host header in addition to the request URI
  • varnish-fe should only consider requests to //varnishcheck.wikimedia.org/from/pybal and //varnishcheck.wikimedia.org/wikimedia-monitoring-test, ie ensure that the Host header is exactly varnishcheck.wikimedia.org, and that the request URI is /from/pybal or /wikimedia-monitoring-test

Once the above has been done as proof that we do understand the status quo, we then should make the world more intuitive and self-explanatory:

  • We should transition to specific URIs for healthchecks, with the path component named after the intended recipient of the heathcheck, and a more generic and self-explanatory hostname. For example: http://healthcheck.wikimedia.org/varnishfe, https://healthcheck.wikimedia.org/ats-tls, etc.
    • This makes it much more obvious what's going on, is more future-proof, and allows interesting possibilities like sending a query to ats-tls to check if the stack is selecting healthy varnishfes or ats-bes.
    • These healthcheck queries should probably never produce TLS redirects?
  • Currently, on a healthcheck query ats-be adds bug as the status indicator to X-Cache/X-Cache-Status. We should match historical precedent here and use int (for "internal")

Related, but out of scope for this ticket is T251301.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jun 10 2020, 12:15 PM

Change 604305 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: handle healthchecks in Lua

https://gerrit.wikimedia.org/r/604305

Change 604364 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: narrow down healthchecks definition

https://gerrit.wikimedia.org/r/604364

Change 604305 merged by Ema:
[operations/puppet@production] ATS: handle healthchecks in Lua

https://gerrit.wikimedia.org/r/604305

Change 604618 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: use HTTP/1.1 and return in Lua healthchecks

https://gerrit.wikimedia.org/r/604618

Change 604618 merged by Ema:
[operations/puppet@production] ATS: use HTTP/1.1 and return in Lua healthchecks

https://gerrit.wikimedia.org/r/604618

Change 604653 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: use Host:varnishcheck.wm.org for icinga checks

https://gerrit.wikimedia.org/r/604653

Change 604653 merged by Ema:
[operations/puppet@production] varnish: use Host:varnishcheck.wm.org for icinga checks

https://gerrit.wikimedia.org/r/604653

Change 604364 merged by Ema:
[operations/puppet@production] varnish: narrow down healthchecks definition

https://gerrit.wikimedia.org/r/604364

Change 604710 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: use X-Cache-Status 'int' for responses without lookup

https://gerrit.wikimedia.org/r/604710

Change 604710 merged by Ema:
[operations/puppet@production] ATS: use X-Cache-Status 'int' for responses without lookup

https://gerrit.wikimedia.org/r/604710

Change 609179 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: update 19-unparseable-host-header.vtc

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609179

Change 609179 merged by Ema:
[operations/puppet@production] varnish: update 19-unparseable-host-header.vtc

https://gerrit.wikimedia.org/r/609179

Change 610041 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: handle backend checks at healthcheck.wm.org/ats-be too

https://gerrit.wikimedia.org/r/610041

Change 610042 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: send backend probes to healthcheck.wm.org/ats-be

https://gerrit.wikimedia.org/r/610042

Change 610046 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: handle checks at healthcheck.wm.org/varnish-fe too

https://gerrit.wikimedia.org/r/610046

Change 610047 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] LVS: update Varnish healthcheck endpoint

https://gerrit.wikimedia.org/r/610047

Change 610048 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] icinga: update Varnish healthcheck endpoint

https://gerrit.wikimedia.org/r/610048

Change 610051 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: stop responding to varnishcheck.wm.org

https://gerrit.wikimedia.org/r/610051

Change 610052 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: stop responding to varnishcheck/status

https://gerrit.wikimedia.org/r/610052

Change 610041 merged by Ema:
[operations/puppet@production] ATS: handle backend checks at healthcheck.wm.org/ats-be too

https://gerrit.wikimedia.org/r/610041

Change 610042 merged by Ema:
[operations/puppet@production] varnish: send backend probes to healthcheck.wm.org/ats-be

https://gerrit.wikimedia.org/r/610042

Change 610046 merged by Ema:
[operations/puppet@production] varnish: handle checks at healthcheck.wm.org/varnish-fe too

https://gerrit.wikimedia.org/r/610046

Change 610048 merged by Ema:
[operations/puppet@production] icinga: update Varnish healthcheck endpoint

https://gerrit.wikimedia.org/r/610048

Change 610047 merged by Ema:
[operations/puppet@production] LVS: update Varnish healthcheck endpoint

https://gerrit.wikimedia.org/r/610047

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:24:54Z] <ema> lvs4007 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:31:51Z] <ema> lvs400[56] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:42:06Z] <ema> lvs2010 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:44:25Z] <ema> lvs200[78] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:47:41Z] <ema> lvs5003 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:50:31Z] <ema> lvs500[12] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:57:06Z] <ema> lvs3007 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T13:59:18Z] <ema> lvs300[56] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T14:07:18Z] <ema> lvs1016 (secondary) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Mentioned in SAL (#wikimedia-operations) [2020-07-20T14:08:33Z] <ema> lvs101[34] (primaries) - restart pybal to apply varnish healthcheck changes https://gerrit.wikimedia.org/r/c/operations/puppet/+/610047 T255015

Change 610051 merged by Ema:
[operations/puppet@production] varnish: stop responding to varnishcheck.wm.org

https://gerrit.wikimedia.org/r/610051

Mentioned in SAL (#wikimedia-operations) [2020-07-27T12:23:16Z] <ema> A:cp rolling varnish-frontend restart to actually discard old VCL still pointing at varnishcheck/check T255015 T236754

Change 610052 merged by Ema:
[operations/puppet@production] ATS: stop responding to varnishcheck/status

https://gerrit.wikimedia.org/r/610052

ema claimed this task.
ema updated the task description. (Show Details)

All done!