Page MenuHomePhabricator

Monitor Phabricator and Gerrit availability
Closed, ResolvedPublic

Description

Phabricator and Gerrit do not receive enough traffic to trip our catchall 5xx monitoring alerts. But these two services are absolutely essential to our engineering workflows, and we need to ensure high availability for them by having high-sensitivity monitoring of errors and latency.

Event Timeline

ori raised the priority of this task from to High.
ori updated the task description. (Show Details)
ori subscribed.

Gerrit availabilty:

http://status.wikimedia.org/8777/249692/Gerrit

Phabricator availability:

http://status.wikimedia.org/8777/388149/Phabricator

probably additionally Catchpoint too in addition to Icinga and Watchmouse.

Just to be sure: we are NOT talking about the traditional sense of HA (https://en.wikipedia.org/wiki/High_availability) with things like redundancy, quick failover, etc. That'd be way out of the ordinary for our dev tools (not that it wouldn't be good in theory, but way more effort than WMF has ever given).

Gerrit availabilty:

http://status.wikimedia.org/8777/249692/Gerrit

Phabricator availability:

http://status.wikimedia.org/8777/388149/Phabricator

probably additionally Catchpoint too in addition to Icinga and Watchmouse.

Are these enough? They seem like they give us "monitoring of errors and latency".

Are these enough? They seem like they give us "monitoring of errors and latency".

I don't know much about the current monitoring stack, but my main question would be: do failures get reported to IRC? That's all that really matters, I imagine. (Though I personally found that awful icinga-wm bot annoying enough to ignore it. Presumably other people are listening to it.)

Yes, icinga announces in IRC when either of those two things listed above (the incinga links) fails.

icinga has paged me, and opsen, on multiple occasions when phabricator was down. I'm pretty sure that it's working.

hashar claimed this task.
hashar subscribed.

Based on our experience we have good enough monitoring for either Gerrit or Phabricator. The critical bits are monitored via Icinga (ex: process existence) and we have enough experimented user that pokes us about potential failures even before monitoring notify them.

We talked about this task a bit during our weekly meeting, and are unsure what was the original intent. Feel free to reopen with a better description of what can be done to enhance the monitoring of both services.

Thanks!