Page MenuHomePhabricator

Track and alert based on gridengine error states
Closed, ResolvedPublic

Description

Queues in error states can't accept new tasks, so we should monitor these and alert if any nodes are in error state.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda subscribed.

A thousand years ago when I ran SGE, there was an administrator email setting. This would send emails if there were problems. Nagios also had SGE plugins. You whippersnappers run some new fangled gadget, but hopefully it has the plugin. Oh, get off my lawn!!

Oh, mails are sent indeed :-). (My) problem however was (and is) that both a job that failed because the output file could not be created due to the user's choice of name and a failed job that causes the whole queue to be disabled come with an unsuspicious subject line "GE 6.2u5: Job 7654321 failed". Only on looking into the mail you can see "Queue 'task@tools-exec-06.eqiad.wmflabs' set to ERROR" in the latter. And the number of mails of the first type make manual inspection impractical.

I guess easiest(?) thing to do is write a diamond collector that tracks gridengine queue stats :)

scfc triaged this task as Medium priority.Apr 6 2015, 11:32 AM
scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.
taavi claimed this task.
taavi subscribed.

Done with metricsinfra prometheus a while ago.