Make elasticsearch more resilient to small network hiccups
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Jan 6 2017, 10:14 AM

Description

It happens from time to time, a small network hiccup can have very bad consequences on the cluster by causing many shards to recover.
We should work on tuning all the various timeouts to make elastic more robust.

Details

	Subject	Repo	Branch	Lines +/-
	elasticsearch: tuning of zen discovery settings	operations/puppet	production	+23 -0

Customize query in gerrit

Event Timeline

dcausse created this task.Jan 6 2017, 10:14 AM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJan 6 2017, 10:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 316976 had a related patch set uploaded (by DCausse):
elasticsearch: tuning of zen discovery settings

https://gerrit.wikimedia.org/r/316976

gerritbot added a project: Patch-For-Review.Jan 6 2017, 10:14 AM

Mentioned in SAL (#wikimedia-operations) [2017-01-07T13:31:30Z] <dcausse> elastic@codfw removing/readding replicas for viwiki_general and zhwiki_content (affected by something similar to https://github.com/elastic/elasticsearch/issues/12661) - T154765

Hi @Gehel - can you take a look at this when you have a chance and deploy it? Thanks!

Gehel claimed this task.Jan 24 2017, 6:14 PM

Summary of discussions with @dcausse and @EBernhardson, in no particular order:

testing failure mode is not trivial, it requires:
- generating synthetic read and write traffic
- simulating network failure
- asserting what kind of error occurs in that traffic (example of errors: partial read results, failed write, successful write that are lost, ...)
Elasticsearch fault detection component and its associated tunables is the active component of fault detection. It seems (to be tested) that passive fault detection is also at play (for example, a write that fails to be acked by a node might declare that node down for the cluster, this is assumed by reading logs). If this is the case, with the fairly high frequency writes that we have, we might not be able to keep nodes in the cluster even in the face of minor network disruption.
The failure mode that we have seen so far related to network interruption seems to actually be quite robust, even if it is somewhat scary. A large number of shard end up unallocated while the cluster recovers. And full recovery takes a long time (1/2 day in some cases). But traffic is still being served.
We might be able to test failure mode on a 5 to 10 node cluster, with small nodes. This might be done on labs VMs.
The failure mode as seen from LVS is interesting and we have not looked at it closely enough. Our current LVS check only test that the node is running, but makes no check on its state. This might be a problem in the rare case where a node is reachable from LVS, but has lost connectivity to the rest of the elasticsearch cluster.

Gehel moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Feb 28 2017, 6:02 PM

debt moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.May 2 2017, 5:12 PM

debt removed a project: Patch-For-Review.Jun 27 2017, 5:31 PM

This is also mostly solved and not really an issue anymore.

Change 316976 abandoned by Gehel:
elasticsearch: tuning of zen discovery settings

https://gerrit.wikimedia.org/r/316976

Make elasticsearch more resilient to small network hiccupsClosed, ResolvedPublicActions

Description

Details

Event Timeline

Make elasticsearch more resilient to small network hiccups
Closed, ResolvedPublic
Actions