Page MenuHomePhabricator

Monitor Hadoop cluster running out of HEAP space with Icinga
Closed, ResolvedPublic8 Estimated Story Points

Description

For all JVMs we have, if HEAP space hits a limit (90%?), send an alert.

Event Timeline

Nuria raised the priority of this task from to High.
Nuria updated the task description. (Show Details)
Nuria added a project: Analytics-Kanban.
Nuria subscribed.
Nuria renamed this task from Icinga Monitoring should detect cluster running out of space to Icinga Monitoring should detect cluster running out of HEAP space .Feb 5 2015, 1:13 AM
Nuria set Security to None.
kevinator renamed this task from Icinga Monitoring should detect cluster running out of HEAP space to Monitor (Icinga) cluster running out of HEAP space.Feb 5 2015, 1:19 AM
kevinator renamed this task from Monitor (Icinga) cluster running out of HEAP space to Monitor cluster running out of HEAP space with Icinga.Feb 5 2015, 1:21 AM

Namenodes only (that's where the problem is really serious)

Ottomata subscribed.

Note: we have created https://grafana.wikimedia.org/dashboard/db/analytics-hadoop to monitor all the GC/Heap metrics of the Hadoop cluster actors, but we are still missing alarms. Maybe something based on graphite thresholds? I can see stuff like the following in puppet:

monitoring::graphite_threshold { 'restbase_analytics_<<some-metric-name>>':
    description   => 'Analytics RESTBase req/s returning 5xx http://grafana.wikimedia.org/#/dashboard/db/restbase',
    metric        => '<<the metric and any transformations>>',
    from          => '10min',
    warning       => '<<warning threshold>>', # <<explain>>
    critical      => '<<critical threshold>>', # <<explain>>
    percentage    => '20',
    contact_group => 'aqs-admins',
}

So we have all the graphite metrics needed and we'd only need to figure out the correct thresholds. Going to ask for a ticket prioritization, this one seems rather important.

Milimetric updated the task description. (Show Details)
Milimetric set the point value for this task to 8.
Milimetric moved this task from Operational Excellence Future to Dashiki on the Analytics board.
Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.
elukey added a project: User-Elukey.

agreed, this seems a good one to add to ops-excellence next quarter.

Change 330154 had a related patch set uploaded (by Elukey):
Add JVM Heap usage alarms for basic Hadoop daemons

https://gerrit.wikimedia.org/r/330154

Change 330154 merged by Elukey:
Add JVM Heap usage alarms for basic Hadoop daemons

https://gerrit.wikimedia.org/r/330154

Change 337574 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337574

Change 337575 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337575

Change 337575 merged by Elukey:
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337575

elukey renamed this task from Monitor cluster running out of HEAP space with Icinga to Monitor Hadoop cluster running out of HEAP space with Icinga.Feb 14 2017, 1:39 PM
elukey moved this task from Paused to Done on the Analytics-Kanban board.

Change 337886 had a related patch set uploaded (by Ottomata):
Fix heapsize alert conditionals so that they work in labs

https://gerrit.wikimedia.org/r/337886

Change 337886 merged by Ottomata:
Fix heapsize alert conditionals so that they work in labs

https://gerrit.wikimedia.org/r/337886

Milimetric reopened this task as Open.

Change 337574 abandoned by Elukey:
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337574