Queues in error states can't accept new tasks, so we should monitor these and alert if any nodes are in error state.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Bstorm | T199271 Upgrade the tools gridengine system | |||
Resolved | taavi | T88237 Track and alert based on gridengine error states |
Event Timeline
A thousand years ago when I ran SGE, there was an administrator email setting. This would send emails if there were problems. Nagios also had SGE plugins. You whippersnappers run some new fangled gadget, but hopefully it has the plugin. Oh, get off my lawn!!
Oh, mails are sent indeed :-). (My) problem however was (and is) that both a job that failed because the output file could not be created due to the user's choice of name and a failed job that causes the whole queue to be disabled come with an unsuspicious subject line "GE 6.2u5: Job 7654321 failed". Only on looking into the mail you can see "Queue 'task@tools-exec-06.eqiad.wmflabs' set to ERROR" in the latter. And the number of mails of the first type make manual inspection impractical.
I guess easiest(?) thing to do is write a diamond collector that tracks gridengine queue stats :)