There's nothing that collects the state of grid queues (and thus nodes) in Toolforge. We've found states before where we are surprised by several queues disabled because of errors on the nodes. A queue in this sense in SGE is a host within a specific queue context (such as task@tools-sgeexec-0901.tools.eqiad.wmflabs).
Collecting that info so it can be displayed on tools-basic-alerts or similar should help.
So far we've found that errors with the webservice script itself (since it is involved in job submission and such) is enough to drop a queue, but LDAP errors on job submission will do the same (putting the queue into the "e" state, which is just as useless as the "d" or depooled state or the "au" or unreachable state). This can, if it happens repeatedly on various queue/host combinations, take the entire grid offline in time.
Remediation is documented for error states at: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Clearing_error_state
However, for error states, it can be reset as simply as `qmod -c '*'` if you aren't worried about troubleshooting.
We need at least an email if the number of available node queues is declining (and if any are in a persistent "e" state).