Make elasticsearch more resilient to small network hiccups
Open, NormalPublic

Description

It happens from time to time, a small network hiccup can have very bad consequences on the cluster by causing many shards to recover.
We should work on tuning all the various timeouts to make elastic more robust.

dcausse created this task.Jan 6 2017, 10:14 AM
Restricted Application added a project: Discovery. · View Herald TranscriptJan 6 2017, 10:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 316976 had a related patch set uploaded (by DCausse):
elasticsearch: tuning of zen discovery settings

https://gerrit.wikimedia.org/r/316976

Mentioned in SAL (#wikimedia-operations) [2017-01-07T13:31:30Z] <dcausse> elastic@codfw removing/readding replicas for viwiki_general and zhwiki_content (affected by something similar to https://github.com/elastic/elasticsearch/issues/12661) - T154765

debt triaged this task as Normal priority.
debt moved this task from Backlog to Needs review on the Discovery-Search (Current work) board.
debt added subscribers: Gehel, debt.

Hi @Gehel - can you take a look at this when you have a chance and deploy it? Thanks!

Gehel claimed this task.Jan 24 2017, 6:14 PM

Summary of discussions with @dcausse and @EBernhardson, in no particular order:

  • testing failure mode is not trivial, it requires:
    • generating synthetic read and write traffic
    • simulating network failure
    • asserting what kind of error occurs in that traffic (example of errors: partial read results, failed write, successful write that are lost, ...)
  • Elasticsearch fault detection component and its associated tunables is the active component of fault detection. It seems (to be tested) that passive fault detection is also at play (for example, a write that fails to be acked by a node might declare that node down for the cluster, this is assumed by reading logs). If this is the case, with the fairly high frequency writes that we have, we might not be able to keep nodes in the cluster even in the face of minor network disruption.
  • The failure mode that we have seen so far related to network interruption seems to actually be quite robust, even if it is somewhat scary. A large number of shard end up unallocated while the cluster recovers. And full recovery takes a long time (1/2 day in some cases). But traffic is still being served.
  • We might be able to test failure mode on a 5 to 10 node cluster, with small nodes. This might be done on labs VMs.
  • The failure mode as seen from LVS is interesting and we have not looked at it closely enough. Our current LVS check only test that the node is running, but makes no check on its state. This might be a problem in the rare case where a node is reachable from LVS, but has lost connectivity to the rest of the elasticsearch cluster.