Page MenuHomePhabricator

OCG checks should be CRITICAL when reading from the server times out
Closed, ResolvedPublic

Description

This morning I noticed in icinga the following alarm:

ocg1003;OCG health;WARNING;HARD;3;WARNING: connection error: HTTPConnectionPool(host='localhost', port=8000): Read timed out. (read timeout=5)

which results in a warning if the read times out or any connection error happens.

This should not only be critical, but also we should get paged when any ocg server is unreachable until T120077 is solved, given right now any malfunction of a single ocg server results in user-noticeable downtime.

Event Timeline

Joe raised the priority of this task from to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe added projects: Services, SRE, Puppet, observability.
Joe added subscribers: Dzahn, Aklapper, cscott and 8 others.
Joe set Security to None.

Change 256412 had a related patch set uploaded (by Giuseppe Lavagetto):
ocg: send out an alarm when ocg doesn't respond to health checks

https://gerrit.wikimedia.org/r/256412

Change 256412 merged by Giuseppe Lavagetto:
ocg: send out an alarm when ocg doesn't respond to health checks

https://gerrit.wikimedia.org/r/256412