@CDanis noticed that ats-be is considering all requests to /check as internal health-checks, regardless of what the Host header says. While looking at this problem I've seen a bunch of related issues that should be fixed.
There are three different type of diagnostic requests that Varnish historically used to handle. The first were requests issued by varnish-fe to varnish-be to check the cache backend layer status. They look like this: http://varnishcheck/check. The second are checks coming from PyBal and directed, logically speaking, to varnish-fe. PyBal requests target https://varnishcheck.wikimedia.org/from/pybal: they go through the TLS termination layer (once nginx, now ats-tls) and then hit varnish-fe. The third are Icinga requests targeting varnishcheck/wikimedia-monitoring-test and hitting varnish-fe directly. Confusingly, our VCL handles all these requests as follows in wm_recv_early. This VCL code used to be shared by both Varnish frontends and backends, and worked for both given that the Host header does indeed start with "varnishcheck" in the case of http://varnishcheck/check, https://varnishcheck/wikimedia-monitoring-test and https://varnishcheck.wikimedia.org/from/pybal:
if ( req.http.host ~ "^varnishcheck" ) { return (synth(200, "healthcheck")); }
To make things less confusing:
- ats-be should only consider requests to //varnishcheck/check as health-checks performed by the frontend layer, ie check the Host header in addition to the request URI
- varnish-fe should only consider requests to //varnishcheck.wikimedia.org/from/pybal and //varnishcheck.wikimedia.org/wikimedia-monitoring-test, ie ensure that the Host header is exactly varnishcheck.wikimedia.org, and that the request URI is /from/pybal or /wikimedia-monitoring-test
Once the above has been done as proof that we do understand the status quo, we then should make the world more intuitive and self-explanatory:
- We should transition to specific URIs for healthchecks, with the path component named after the intended recipient of the heathcheck, and a more generic and self-explanatory hostname. For example: http://healthcheck.wikimedia.org/varnishfe, https://healthcheck.wikimedia.org/ats-tls, etc.
- This makes it much more obvious what's going on, is more future-proof, and allows interesting possibilities like sending a query to ats-tls to check if the stack is selecting healthy varnishfes or ats-bes.
- These healthcheck queries should probably never produce TLS redirects?
- Currently, on a healthcheck query ats-be adds bug as the status indicator to X-Cache/X-Cache-Status. We should match historical precedent here and use int (for "internal")
Related, but out of scope for this ticket is T251301.