We need to notice queues in alarm/error states before they actually cause issues. The lack of monitoring means we missed
- the lack of available webgrid hosts (https://wikitech.wikimedia.org/wiki/Incident_documentation/20150817-ToolLabs-WebgridOutage)
- queues being offline due to a DNS issue (T109605)