Page MenuHomePhabricator

Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memory
Closed, ResolvedPublic

Description

While I'd like Elasticsearch to never run out of memory - shit happens. When one of the Elasticsearch nodes fills its memory the cluster goes unstable. You get lots of these messages:

[2015-06-15 07:08:27,157][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [85169ms] ago, timed out [55169ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666516874]
[2015-06-15 07:08:27,169][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [186476ms] ago, timed out [156476ms] ago, action [discovery/zen/fd/ping], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666499840]
[2015-06-15 07:08:27,169][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [156476ms] ago, timed out [126475ms] ago, action [discovery/zen/fd/ping], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666507408]
[2015-06-15 07:08:27,169][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [126475ms] ago, timed out [96475ms] ago, action [discovery/zen/fd/ping], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666508860]
[2015-06-15 07:08:27,176][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [175193ms] ago, timed out [145193ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666500327]
[2015-06-15 07:08:27,177][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [115191ms] ago, timed out [85191ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666509258]
[2015-06-15 07:08:27,187][WARN ][transport                ] [elastic1001] Received response for a request that has timed out, sent [55198ms] ago, timed out [25198ms] ago, action [/cluster/nodes/indices/shard/store/n], node [[elastic1018][dNTrd0-ET-CNTNHKk8hdHA][elastic1018][inet[/10.64.48.40:9300]]{rack=D3, row=D, master=false}], id [2666518331]

Event Timeline

Manybubbles updated the task description. (Show Details)
Manybubbles raised the priority of this task from to High.
Manybubbles changed the visibility from "Public (No Login Required)" to "WMF-NDA (Project)".
Manybubbles changed the edit policy from "All Users" to "WMF-NDA (Project)".

These messages may just be a symptom - the issue is that the master no longer responds to master things. Maybe this is our fault somehow. Maybe Elasticsearch's broken somehow. Maybe it'll go away with 1.6 upgrade. I dunno.

Manybubbles added a parent task: Restricted Task.Jun 16 2015, 8:34 AM
Manybubbles changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jun 16 2015, 8:42 AM
Manybubbles changed the edit policy from "WMF-NDA (Project)" to "All Users".
Manybubbles set Security to None.

Removing NDA access restriction - this doesn't have any private information or give anyone information on how to bring us down.

Are these messages coming from master, or non-master? Could we simulate/debug this by forcing network timeouts between the two?

The master.

I talked to the guy that maintains these algorithms. A _ton_ has changed between 1.3 and 1.6 - enough that while he didn't recognize exactly these issues he thinks they are fixed by 1.6.

As far as simulating network timeouts - have a look at the Jepsen tests:
https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0

He links to his earlier analysis in that one and its much crumblier.

As far as this bug goes I'd be ok saying "this is probably fixed in 1.6 but we will open a new one if it happens again after that".

Manybubbles closed this task as Resolved.Jun 25 2015, 6:38 PM
Manybubbles claimed this task.

At this point the best we're going to get is upgrading to Elasticsearch 1.6. That way if we go belly up like this again we can take it back to them.

We should also work more on getting that failover cluster setup in Texas. Which, conveniently, ops has just emailed me about. I don't know when they'll start on buying the nodes but we can get Cirrus to support it then.

Manybubbles reopened this task as Open.
Deskana renamed this task from Figure out why Elasticsearch doesn't recovery from oom - even after we bounce the node that was out of memory to Figure out why Elasticsearch doesn't recover from out of memory - even after we bounce the node that was out of memory.Sep 12 2015, 2:45 AM
Deskana closed this task as Resolved.