Page MenuHomePhabricator

Setup a simple service that pages when it is unreachable
Closed, ResolvedPublic

Description

We used to page when tools.wmflabs.org was down, which was problematic because of the varying number of subsystems that could go wrong causing it to page. Instead, create a simple webservice that does the simplest thing possible (serve a page, probably?) and alert when *that* goes down. This could still be caused by:

  1. Proxy failure
  2. Webservice failure
  3. DNS failure
  4. Network failure

But is still a lot more robust than the main page check.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a project: Cloud-Services. · View Herald TranscriptAug 23 2016, 6:06 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Not that I'm against a canary check, but a situation where we only have a check that spans several layers as a defacto health monitor was exhausting and low value. DNS failure checks already exist, network failure checks (depending on what and where) exist for tools in an implicit state.

Proxy failure is totally blind for sure, and so is webservice. A check that monitors the state of the proxy itself and backend failures? And then this simple webservice which in conjunction with the other components won't have the "ugh this down again and it could be anything" fatigue?

I think complex end-to-end tests have been a bad substitute for smaller component limited checks.

stop gap thought, paws outage to page all @labs personnel?

How about we setup a simple endpoint in the proxy configuration itself and check that? That would catch the proxy specifically.

How about we setup a simple endpoint in the proxy configuration itself and check that? That would catch the proxy specifically.

Sounds like a good idea to me.

we agreed to make a check for the proxy health itself today, I'll get added to the PAWS check and we'll iterate on this from there.

Change 314707 had a related patch set uploaded (by Madhuvishy):
tools proxy: Add health check and icinga monitoring

https://gerrit.wikimedia.org/r/314707

Mentioned in SAL (#wikimedia-labs) [2016-10-26T23:20:27Z] <madhuvishy> Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638

Change 314707 merged by Madhuvishy:
tools proxy: Add health check and icinga monitoring

https://gerrit.wikimedia.org/r/314707

Change 318226 had a related patch set uploaded (by Madhuvishy):
dynamicproxy: Fix health check endpoint location

https://gerrit.wikimedia.org/r/318226

Change 318226 merged by Madhuvishy:
dynamicproxy: Fix health check endpoint location

https://gerrit.wikimedia.org/r/318226

Mentioned in SAL (#wikimedia-operations) [2016-10-26T23:46:05Z] <madhuvishy> tools reenabled puppet across proxy hosts. /.well-known/healthz now live on tools-proxy T143638

madhuvishy closed this task as Resolved.Oct 27 2016, 3:27 PM