Page MenuHomePhabricator

Set up Icinga monitoring for grid
Closed, DuplicatePublic

Description

Besides the Ganglia statistics, the grid's status should be properly monitored and alarms set up. From the top of my head and without data to back it up:

  • Master alive and well (no threads in error state!),
  • every execution daemon alive and well,
  • count of jobs in error state doesn't exceed 5 % of all jobs running,
  • count of jobs pending doesn't exceed 5 % of all jobs running.

Version: unspecified
Severity: enhancement

Details

Reference
bz48668

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:20 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz48668.

sumanah wrote:

As with bug 51434 , I think this would be a very good step for improving the reliability of the services we provide -- and getting stats to show it. :-)