Page MenuHomePhabricator

Monitor Hadoop cluster running out of HEAP space with Icinga
Closed, ResolvedPublic8 Story Points

Description

For all JVMs we have, if HEAP space hits a limit (90%?), send an alert.

Event Timeline

Nuria created this task.Feb 5 2015, 1:13 AM
Nuria raised the priority of this task from to High.
Nuria updated the task description. (Show Details)
Nuria added a project: Analytics-Kanban.
Nuria added a subscriber: Nuria.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2015, 1:13 AM
Nuria renamed this task from Icinga Monitoring should detect cluster running out of space to Icinga Monitoring should detect cluster running out of HEAP space .Feb 5 2015, 1:13 AM
Nuria set Security to None.
kevinator renamed this task from Icinga Monitoring should detect cluster running out of HEAP space to Monitor (Icinga) cluster running out of HEAP space.Feb 5 2015, 1:19 AM
kevinator renamed this task from Monitor (Icinga) cluster running out of HEAP space to Monitor cluster running out of HEAP space with Icinga.Feb 5 2015, 1:21 AM

Namenodes only (that's where the problem is really serious)

elukey added a subscriber: elukey.
Ottomata removed Ottomata as the assignee of this task.Aug 8 2016, 7:45 PM
Ottomata added a subscriber: Ottomata.

Note: we have created https://grafana.wikimedia.org/dashboard/db/analytics-hadoop to monitor all the GC/Heap metrics of the Hadoop cluster actors, but we are still missing alarms. Maybe something based on graphite thresholds? I can see stuff like the following in puppet:

monitoring::graphite_threshold { 'restbase_analytics_<<some-metric-name>>':
    description   => 'Analytics RESTBase req/s returning 5xx http://grafana.wikimedia.org/#/dashboard/db/restbase',
    metric        => '<<the metric and any transformations>>',
    from          => '10min',
    warning       => '<<warning threshold>>', # <<explain>>
    critical      => '<<critical threshold>>', # <<explain>>
    percentage    => '20',
    contact_group => 'aqs-admins',
}

So we have all the graphite metrics needed and we'd only need to figure out the correct thresholds. Going to ask for a ticket prioritization, this one seems rather important.

Milimetric assigned this task to elukey.Sep 15 2016, 4:12 PM
Milimetric updated the task description. (Show Details)
Milimetric set the point value for this task to 8.
Milimetric moved this task from Operational Excellence Future to Dashiki on the Analytics board.
Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.

Let's do it!

elukey removed elukey as the assignee of this task.Dec 14 2016, 11:02 AM
elukey added a project: User-Elukey.
Nuria added a comment.Dec 14 2016, 5:07 PM

agreed, this seems a good one to add to ops-excellence next quarter.

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Dec 14 2016, 5:44 PM

While reviewing this task, I opened https://phabricator.wikimedia.org/T153951 :D

elukey claimed this task.Dec 22 2016, 4:03 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.Dec 22 2016, 4:10 PM

Change 330154 had a related patch set uploaded (by Elukey):
Add JVM Heap usage alarms for basic Hadoop daemons

https://gerrit.wikimedia.org/r/330154

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Jan 17 2017, 4:04 PM
Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Jan 30 2017, 4:07 PM

Change 330154 merged by Elukey:
Add JVM Heap usage alarms for basic Hadoop daemons

https://gerrit.wikimedia.org/r/330154

Change 337574 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337574

Change 337575 had a related patch set uploaded (by Elukey):
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337575

Change 337575 merged by Elukey:
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337575

elukey renamed this task from Monitor cluster running out of HEAP space with Icinga to Monitor Hadoop cluster running out of HEAP space with Icinga.Feb 14 2017, 1:39 PM
elukey moved this task from Paused to Done on the Analytics-Kanban board.
Tnegrin removed a subscriber: Tnegrin.Feb 14 2017, 1:45 PM

Change 337886 had a related patch set uploaded (by Ottomata):
Fix heapsize alert conditionals so that they work in labs

https://gerrit.wikimedia.org/r/337886

Change 337886 merged by Ottomata:
Fix heapsize alert conditionals so that they work in labs

https://gerrit.wikimedia.org/r/337886

Milimetric closed this task as Resolved.Feb 23 2017, 4:46 PM
Milimetric reopened this task as Open.
elukey closed this task as Resolved.Feb 24 2017, 9:49 AM

Change 337574 abandoned by Elukey:
Fix and tune the new Analytics Hadoop alarms

https://gerrit.wikimedia.org/r/337574