Page MenuHomePhabricator

Icinga monitoring for elasticsearch doesn't notice OOM conditions (this is happening on cloud)
Closed, ResolvedPublic

Description

The monitoring for Elasticsearch doesn't seem to notice when an OOM has happened on a node. I have very seldom seen a node recover on it's own from an OOM so it seems like something worth of alerting about.

Event Timeline

bd808 created this task.Nov 26 2014, 9:13 PM
bd808 raised the priority of this task from to Needs Triage.
bd808 updated the task description. (Show Details)
bd808 added a project: Wikimedia-Logstash.
bd808 changed Security from none to None.
bd808 added subscribers: bd808, Gage, Manybubbles.
bd808 added a comment.Nov 26 2014, 9:17 PM

I think I remember there being a java jre command line flag that lets you install an OOM signal handler. I have vague memories of using that to send alerts in a long forgotten past as a Java Shop Administrator™.

bd808 triaged this task as Medium priority.Feb 2 2015, 5:07 PM
hashar added a subscriber: hashar.

Moving that monitoring task from Release-Engineering-Team to Operations

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJan 26 2019, 7:50 PM

Change 487787 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: exit the JVM on OutOfMemoryError

https://gerrit.wikimedia.org/r/487787

Gehel added a subscriber: Gehel.Feb 3 2019, 11:26 PM

Instead of monitoring this specific error, let's just configure the JVM to restart on memory errors.

Gehel claimed this task.Feb 3 2019, 11:26 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.
hashar removed a subscriber: hashar.Feb 4 2019, 11:58 AM
Nuria renamed this task from Icinga monitoring for elasticsearch doesn't notice OOM conditions to Icinga monitoring for elasticsearch doesn't notice OOM conditions (this is happening on cloud) .Mar 5 2019, 6:28 PM

Change 487787 merged by Gehel:
[operations/puppet@production] elasticsearch: exit the JVM on OutOfMemoryError

https://gerrit.wikimedia.org/r/487787

Merged, will take effect with the next cluster restarts.

debt closed this task as Resolved.Mar 8 2019, 5:52 PM