Page MenuHomePhabricator

Create aggregate alarms for Hadoop daemons running on worker nodes
Open, HighPublic

Description

During the reimage of an-master100[1,2] we got a storm of alerts related to NodeManagers down for various reasons. We should get rid of the per-host alerts and create aggregate alerts.

For example, as opposed to have:

18:53  <icinga-wm> PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args 
                   org.apache.hadoop.yarn.server.nodemanager.NodeManager 
                   https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
18:53  <icinga-wm> PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args 
                   org.apache.hadoop.yarn.server.nodemanager.NodeManager 
                   https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[..long list..]

we should get something like:

18:53  <icinga-wm> PROBLEM - Hadoop NodeManager on more than X workers is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args 
                   org.apache.hadoop.yarn.server.nodemanager.NodeManager 
                   https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process

So a single alert that points to the problem. It will reduce spam to IRC and confusion when debugging what's wrong.

An idea could be to rely on the availability of the prometheus metrics (namely if the prometheus masters can poll metrics from X workers) since all the ones that we manage are deployed as javaagent (so if the overall daemon is down no metrics and no prometheus endpoint on host available).

Event Timeline

We discussed this in the SRE sync and agreed that this should be high priority, given the level of IRC logspam caused by a node manager failure.
I'm happy to take this on if you think that's sensible? cc: @Ottomata , @odimitrijevic

odimitrijevic triaged this task as High priority.
odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

@BTullis: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!