Page MenuHomePhabricator

Services need external monitoring
Closed, ResolvedPublic

Description

Per IRC discussion with @GWicke and @mobrovac, services aren't monitored for correctness externally, which means that there is no way to detect Varnish-level problems like T166735: Some portion of map service down: ""ids" or "query" parameter must be given". To address that, every externally accessible service (e.g. https://en.wikipedia.org/api/rest_v1/ or https://maps.wikimedia.org/) should be monitored using its user-facing entry point, based on its spec.yaml.

Details

Related Gerrit Patches:

Event Timeline

MaxSem created this task.Jun 5 2017, 5:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 5 2017, 5:10 PM
GWicke triaged this task as Normal priority.Jun 5 2017, 6:11 PM
mobrovac claimed this task.Jun 5 2017, 8:37 PM
mobrovac edited projects, added User-mobrovac, Services (next); removed Services.

I will look into whether there is a possibility of having a generic solution for this.

Joe moved this task from Backlog to Doing on the User-Joe board.Jun 6 2017, 2:24 PM
Joe added a subscriber: Joe.

I would start monitoring restbase on text-lb and maps on text-upload.

Joe added a comment.Jun 7 2017, 3:04 PM

In order to do that, I want to do a local nrpe check on the cache edge servers, calling the SSL terminator, so that we cover as many logical layers as possible. Sadly, there is a bug in service-checker that I need to fix before this can go live, but apart from that it should be pretty straightforward.

faidon added a subscriber: faidon.Jun 7 2017, 3:08 PM

Why not from the Icinga host itself like we do with all high-level LVS checks?

Mentioned in SAL (#wikimedia-operations) [2017-06-08T05:59:34Z] <_joe_> uploading new service-checker version to reprepro, T167048

Joe added a comment.Jun 8 2017, 6:07 AM

@faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I thought the nrpe checks on caches to be better just because it would monitor each cache host and not round-robin every host in a pool. It might also help seeing problems on individual caches.

Change 357805 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache: add monitoring of services at the SSL termination level

https://gerrit.wikimedia.org/r/357805

mobrovac reassigned this task from mobrovac to Joe.Jun 8 2017, 1:26 PM

Change 358032 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] cache: add monitoring of services at the SSL termination level

https://gerrit.wikimedia.org/r/358032

Change 358032 merged by Giuseppe Lavagetto:
[operations/puppet@production] cache: add monitoring of services at the SSL termination level

https://gerrit.wikimedia.org/r/358032

Mentioned in SAL (#wikimedia-operations) [2017-06-09T15:47:21Z] <_joe_> installed python-service-checker 0.1.3 on einsteinium,tegmen T167048

Joe added a comment.Jun 9 2017, 3:47 PM

both maps and restbase are now monitored at the load-balancers of the SSL terminators in all datacenters. Resolving.

GWicke closed this task as Resolved.Jun 9 2017, 3:56 PM

Thank you, @Joe!

Change 357805 abandoned by Giuseppe Lavagetto:
cache: add monitoring of services at the SSL termination level

Reason:
We opted to check at the LVS level instead.

https://gerrit.wikimedia.org/r/357805