Page MenuHomePhabricator

Services need external monitoring
Closed, ResolvedPublic

Description

Per IRC discussion with @GWicke and @mobrovac, services aren't monitored for correctness externally, which means that there is no way to detect Varnish-level problems like T166735: Some portion of map service down: ""ids" or "query" parameter must be given". To address that, every externally accessible service (e.g. https://en.wikipedia.org/api/rest_v1/ or https://maps.wikimedia.org/) should be monitored using its user-facing entry point, based on its spec.yaml.

Event Timeline

GWicke triaged this task as Medium priority.Jun 5 2017, 6:11 PM
mobrovac edited projects, added User-mobrovac, Services (next); removed Services.

I will look into whether there is a possibility of having a generic solution for this.

Joe subscribed.

I would start monitoring restbase on text-lb and maps on text-upload.

In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.

Why not from the Icinga host itself like we do with all high-level LVS checks?

Mentioned in SAL (#wikimedia-operations) [2017-06-08T05:59:34Z] <_joe_> uploading new service-checker version to reprepro, T167048

@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.

Change 357805 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache: add monitoring of services at the SSL termination level

https://gerrit.wikimedia.org/r/357805

Change 358032 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache: add monitoring of services at the SSL termination level

https://gerrit.wikimedia.org/r/358032

Change 358032 merged by Giuseppe Lavagetto:
[operations/puppet@production] cache: add monitoring of services at the SSL termination level

https://gerrit.wikimedia.org/r/358032

Mentioned in SAL (#wikimedia-operations) [2017-06-09T15:47:21Z] <_joe_> installed python-service-checker 0.1.3 on einsteinium,tegmen T167048

both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving.

Change 357805 abandoned by Giuseppe Lavagetto:
cache: add monitoring of services at the SSL termination level

Reason:
We opted to check at the LVS level instead.

https://gerrit.wikimedia.org/r/357805