Page MenuHomePhabricator

Improve monitoring of https://git.wikimedia.org/
Closed, ResolvedPublic

Description

Just now we had an outage of gitblit on antimony. Symptom was a very slow response from https://git.wikimedia.org/, and then "Internal error". Solution was to restart gitblit (I didn't find any logs, so root cause is unknown), which took about 2 minutes to start serving requests. During that time we got a different error message from misc-varnish.

Thanks to @mmodell and @hoo for reporting & advice.

Task: improve existing Icinga monitor to detect this condition.

Event Timeline

Gage raised the priority of this task from to Needs Triage.
Gage updated the task description. (Show Details)
Gage added subscribers: Gage, hoo, mmodell.
Gage renamed this task from Monitor https://git.wikimedia.org/ to Improve monitoring of https://git.wikimedia.org/.Mar 28 2015, 11:57 PM
Gage updated the task description. (Show Details)
Gage set Security to None.
faidon triaged this task as Lowest priority.Mar 30 2015, 9:27 AM
faidon subscribed.

git.wm.org is known to be broken, see T73974. Monitoring wouldn't help us all that much...

Even suggested a patch in the past that would let Icinga automatically restart gitblit when monitoring detects it as down but it was rejected for a couple reasons.

Gage changed the task status from Open to Stalled.May 6 2015, 6:00 PM

The monitoring isn't the problem, the service is :p This ticket as it currently stands is resolved and has been a long time.

I once suggested to let icinga auto restart gitblit when it goes down but that was rejected. See the Gerrit comments for reasons.

So there is like nothing to do here. Except that gitblit is unstable which we already have on T83702.

Dzahn claimed this task.

monitoring and notifications works: example from IRC:

17:35 < icinga-wm> PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:31 < icinga-wm> RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 62383 bytes in 0.116 second response time

We even have both, running process and http from external.

gitblit process monitoring: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=antimony&service=gitblit+process

git.wikimedia.org http monitoring: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=antimony&service=git.wikimedia.org