Monitor Phabricator and Gerrit availability
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	Oct 15 2015, 2:44 PM

Description

Phabricator and Gerrit do not receive enough traffic to trip our catchall 5xx monitoring alerts. But these two services are absolutely essential to our engineering workflows, and we need to ensure high availability for them by having high-sensitivity monitoring of errors and latency.

Related Objects

Mentioned Here: T109279: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections.

Event Timeline

ori created this task.Oct 15 2015, 2:44 PM

ori raised the priority of this task from to High.

ori updated the task description. (Show Details)

ori added projects: acl*sre-team, Release-Engineering-Team.

ori subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 15 2015, 2:44 PM

To some extend related: T109279: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections.

gerrit process monitoring:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ytterbium&service=gerrit+process

phabricator http monitoring:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=iridium&service=https%3A%2F%2Fphabricator.wikimedia.org

Gerrit availabilty:

http://status.wikimedia.org/8777/249692/Gerrit

Phabricator availability:

http://status.wikimedia.org/8777/388149/Phabricator

probably additionally Catchpoint too in addition to Icinga and Watchmouse.

Just to be sure: we are NOT talking about the traditional sense of HA (https://en.wikipedia.org/wiki/High_availability) with things like redundancy, quick failover, etc. That'd be way out of the ordinary for our dev tools (not that it wouldn't be good in theory, but way more effort than WMF has ever given).

In T115611#1730626, @Dzahn wrote:

Gerrit availabilty:

http://status.wikimedia.org/8777/249692/Gerrit

Phabricator availability:

http://status.wikimedia.org/8777/388149/Phabricator

probably additionally Catchpoint too in addition to Icinga and Watchmouse.

Are these enough? They seem like they give us "monitoring of errors and latency".

• MZMcBride subscribed.Oct 18 2015, 5:22 PM

In T115611#1730798, @greg wrote:

Are these enough? They seem like they give us "monitoring of errors and latency".

I don't know much about the current monitoring stack, but my main question would be: do failures get reported to IRC? That's all that really matters, I imagine. (Though I personally found that awful icinga-wm bot annoying enough to ignore it. Presumably other people are listening to it.)

Yes, icinga announces in IRC when either of those two things listed above (the incinga links) fails.

icinga has paged me, and opsen, on multiple occasions when phabricator was down. I'm pretty sure that it's working.

Based on our experience we have good enough monitoring for either Gerrit or Phabricator. The critical bits are monitored via Icinga (ex: process existence) and we have enough experimented user that pokes us about potential failures even before monitoring notify them.

We talked about this task a bit during our weekly meeting, and are unsure what was the original intent. Feel free to reopen with a better description of what can be done to enhance the monitoring of both services.

Thanks!

greg added a project: Essential-Work.Jan 11 2016, 10:49 PM

Monitor Phabricator and Gerrit availabilityClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Monitor Phabricator and Gerrit availability
Closed, ResolvedPublic
Actions