Per IRC discussion with @GWicke and @mobrovac, services aren't monitored for correctness externally, which means that there is no way to detect Varnish-level problems like T166735: Some portion of map service down: ""ids" or "query" parameter must be given". To address that, every externally accessible service (e.g. https://en.wikipedia.org/api/rest_v1/ or https://maps.wikimedia.org/) should be monitored using its user-facing entry point, based on its spec.yaml.
Description
Details
Related Objects
Event Timeline
I will look into whether there is a possibility of having a generic solution for this.
In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.
Mentioned in SAL (#wikimedia-operations) [2017-06-08T05:59:34Z] <_joe_> uploading new service-checker version to reprepro, T167048
@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.
Change 357805 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache: add monitoring of services at the SSL termination level
Change 358032 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache: add monitoring of services at the SSL termination level
Change 358032 merged by Giuseppe Lavagetto:
[operations/puppet@production] cache: add monitoring of services at the SSL termination level
Mentioned in SAL (#wikimedia-operations) [2017-06-09T15:47:21Z] <_joe_> installed python-service-checker 0.1.3 on einsteinium,tegmen T167048
both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving.
Change 357805 abandoned by Giuseppe Lavagetto:
cache: add monitoring of services at the SSL termination level
Reason:
We opted to check at the LVS level instead.