Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memory
Closed, ResolvedPublic
Actions

Description

While I'd like Elasticsearch to never run out of memory - shit happens. When one of the Elasticsearch nodes fills its memory the cluster goes unstable. You get lots of these messages:

[2015-06-15 07:08:27,157][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [85169ms] ago, timed out [55169ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666516874]
[2015-06-15 07:08:27,169][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [186476ms] ago, timed out [156476ms] ago, action [discovery/zen/fd/ping], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666499840]
[2015-06-15 07:08:27,169][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [156476ms] ago, timed out [126475ms] ago, action [discovery/zen/fd/ping], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666507408]
[2015-06-15 07:08:27,169][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [126475ms] ago, timed out [96475ms] ago, action [discovery/zen/fd/ping], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666508860]
[2015-06-15 07:08:27,176][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [175193ms] ago, timed out [145193ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666500327]
[2015-06-15 07:08:27,177][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [115191ms] ago, timed out [85191ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666509258]
[2015-06-15 07:08:27,187][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [55198ms] ago, timed out [25198ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666518331]

Related Objects
Search...

Status	Assigned	Task
Resolved	Joe	T102463 Search gives an errormessage instead of search results
		Restricted Task
		Restricted Task
Resolved	• Manybubbles	T102594 Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memory

Event Timeline

These messages may just be a symptom - the issue is that the master no longer responds to master things. Maybe this is our fault somehow. Maybe Elasticsearch's broken somehow. Maybe it'll go away with 1.6 upgrade. I dunno.

• Manybubbles added a parent task: Restricted Task.Jun 16 2015, 8:34 AM

Removing NDA access restriction - this doesn't have any private information or give anyone information on how to bring us down.

dcausse subscribed.Jun 16 2015, 7:22 PM

Are these messages coming from master, or non-master? Could we simulate/debug this by forcing network timeouts between the two?

The master.

I talked to the guy that maintains these algorithms. A _ton_ has changed between 1.3 and 1.6 - enough that while he didn't recognize exactly these issues he thinks they are fixed by 1.6.

As far as simulating network timeouts - have a look at the Jepsen tests:
https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0

He links to his earlier analysis in that one and its much crumblier.

As far as this bug goes I'd be ok saying "this is probably fixed in 1.6 but we will open a new one if it happens again after that".

At this point the best we're going to get is upgrading to Elasticsearch 1.6. That way if we go belly up like this again we can take it back to them.

We should also work more on getting that failover cluster setup in Texas. Which, conveniently, ops has just emailed me about. I don't know when they'll start on buying the nodes but we can get Cirrus to support it then.

• Manybubbles reopened this task as Open.Jun 25 2015, 6:38 PM

• Manybubbles moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

• chasemp mentioned this in T109089: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade).Aug 14 2015, 6:31 PM

• ksmith added a project: Essential-Work.Sep 11 2015, 9:55 PM

• Deskana renamed this task from Figure out why Elasticsearch doesn't recovery from oom - even after we bounce the node that was out of memory to Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memory.Sep 12 2015, 2:45 AM

• Deskana closed this task as Resolved.

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.Sep 12 2015, 2:48 AM

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:09 AM

Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memoryClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memory
Closed, ResolvedPublic
Actions

Related Objects
Search...