Besides the Ganglia statistics, the grid's status should be properly monitored and alarms set up. From the top of my head and without data to back it up:
- Master alive and well (no threads in error state!),
- every execution daemon alive and well,
- count of jobs in error state doesn't exceed 5 % of all jobs running,
- count of jobs pending doesn't exceed 5 % of all jobs running.