Create aggregate alarms for Hadoop daemons running on worker nodes
Open, HighPublic
Actions

Assigned To

None

Authored By

	elukey
	Jul 20 2021, 4:55 PM

Description

During the reimage of an-master100[1,2] we got a storm of alerts related to NodeManagers down for various reasons. We should get rid of the per-host alerts and create aggregate alerts.

For example, as opposed to have:

18:53  <icinga-wm> PROBLEM - Hadoop NodeManager on an-worker1099 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args 
                   org.apache.hadoop.yarn.server.nodemanager.NodeManager 
                   https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
18:53  <icinga-wm> PROBLEM - Hadoop NodeManager on an-worker1113 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args 
                   org.apache.hadoop.yarn.server.nodemanager.NodeManager 
                   https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[..long list..]

we should get something like:

18:53  <icinga-wm> PROBLEM - Hadoop NodeManager on more than X workers is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args 
                   org.apache.hadoop.yarn.server.nodemanager.NodeManager 
                   https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process

So a single alert that points to the problem. It will reduce spam to IRC and confusion when debugging what's wrong.

An idea could be to rely on the availability of the prometheus metrics (namely if the prometheus masters can poll metrics from X workers) since all the ones that we manage are deployed as javaagent (so if the overall daemon is down no metrics and no prometheus endpoint on host available).

Related Objects

Mentioned In: T278423: Upgrade the Hadoop masters to Debian Buster
Mentioned Here: T293399: Migrate the majority of the analytics cluster alerts from Icinga to AlertManager

Event Timeline

elukey created this task.Jul 20 2021, 4:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2021, 4:55 PM

BTullis subscribed.Jul 20 2021, 4:57 PM

elukey mentioned this in T278423: Upgrade the Hadoop masters to Debian Buster.Jul 21 2021, 6:01 AM

We discussed this in the SRE sync and agreed that this should be high priority, given the level of IRC logspam caused by a node manager failure.
I'm happy to take this on if you think that's sensible? cc: @Ottomata , @odimitrijevic

Go for it!

odimitrijevic assigned this task to BTullis.Aug 2 2021, 3:47 PM

odimitrijevic triaged this task as High priority.

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

This will be done as part of T293399: Migrate the majority of the analytics cluster alerts from Icinga to AlertManager

BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Oct 28 2021, 5:08 PM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:29 AM

JArguello-WMF moved this task from Ops Week to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:04 PM

BTullis edited projects, added Data-Platform-SRE; removed Data-Engineering.Jul 15 2023, 12:11 AM

@BTullis: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action... → Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Gehel moved this task from Incoming to Observability on the Data-Platform-SRE board.Dec 6 2023, 1:34 PM

Create aggregate alarms for Hadoop daemons running on worker nodesOpen, HighPublicActions

Description

Related Objects

Event Timeline

Create aggregate alarms for Hadoop daemons running on worker nodes
Open, HighPublic
Actions