During the reimage of an-master100[1,2] we got a storm of alerts related to NodeManagers down for various reasons. We should get rid of the per-host alerts and create aggregate alerts.
For example, as opposed to have:
18:53 <icinga-wm> PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process 18:53 <icinga-wm> PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [..long list..]
we should get something like:
18:53 <icinga-wm> PROBLEM - Hadoop NodeManager on more than X workers is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
So a single alert that points to the problem. It will reduce spam to IRC and confusion when debugging what's wrong.
An idea could be to rely on the availability of the prometheus metrics (namely if the prometheus masters can poll metrics from X workers) since all the ones that we manage are deployed as javaagent (so if the overall daemon is down no metrics and no prometheus endpoint on host available).