Page MenuHomePhabricator

Improve monitoring of https://git.wikimedia.org/
Closed, ResolvedPublic

Description

Just now we had an outage of gitblit on antimony. Symptom was a very slow response from https://git.wikimedia.org/, and then "Internal error". Solution was to restart gitblit (I didn't find any logs, so root cause is unknown), which took about 2 minutes to start serving requests. During that time we got a different error message from misc-varnish.

Thanks to @mmodell and @hoo for reporting & advice.

Task: improve existing Icinga monitor to detect this condition.

Event Timeline

Gage raised the priority of this task from to Needs Triage.
Gage updated the task description. (Show Details)
Gage added subscribers: Gage, hoo, mmodell.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 28 2015, 11:54 PM
Gage renamed this task from Monitor https://git.wikimedia.org/ to Improve monitoring of https://git.wikimedia.org/.Mar 28 2015, 11:57 PM
Gage updated the task description. (Show Details)
Gage set Security to None.
faidon triaged this task as Lowest priority.Mar 30 2015, 9:27 AM
faidon added a subscriber: faidon.

git.wm.org is known to be broken, see T73974. Monitoring wouldn't help us all that much...

Dzahn added a subscriber: Dzahn.Mar 30 2015, 3:50 PM

Even suggested a patch in the past that would let Icinga automatically restart gitblit when monitoring detects it as down but it was rejected for a couple reasons.

Gage changed the task status from Open to Stalled.May 6 2015, 6:00 PM
greg edited projects, added Gitblit; removed Gerrit.Sep 18 2015, 9:03 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 18 2015, 9:03 PM
Dzahn added a comment.Oct 7 2015, 5:39 AM

The monitoring isn't the problem, the service is :p This ticket as it currently stands is resolved and has been a long time.

I once suggested to let icinga auto restart gitblit when it goes down but that was rejected. See the Gerrit comments for reasons.

So there is like nothing to do here. Except that gitblit is unstable which we already have on T83702.

Dzahn closed this task as Resolved.Oct 7 2015, 5:45 AM
Dzahn claimed this task.

monitoring and notifications works: example from IRC:

17:35 < icinga-wm> PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:31 < icinga-wm> RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 62383 bytes in 0.116 second response time

We even have both, running process and http from external.

gitblit process monitoring: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=antimony&service=gitblit+process

git.wikimedia.org http monitoring: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=antimony&service=git.wikimedia.org