Per IRC discussion with @GWicke and @mobrovac, services aren't monitored for correctness externally, which means that there is no way to detect Varnish-level problems like T166735: Some portion of map service down: ""ids" or "query" parameter must be given". To address that, every externally accessible service (e.g. https://en.wikipedia.org/api/rest_v1/ or https://maps.wikimedia.org/) should be monitored using its user-facing entry point, based on its spec.yaml.
In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.
@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.