Page MenuHomePhabricator

Make elasticsearch configuration more robust to loss of network connectivity
Closed, DeclinedPublic

Description

During recent network maintenance, elasticsearch nodes parted the cluster, resulting in a search outage. Configuration should be more robust to this kind of maintenance. Fault detection should be increased so that "standard" loss of networking does not result in nodes parting the cluster. For example:

discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

The actual time needs to be defined.

The risk in increasing fault detection time would be to increase time before a dead node is detected, resulting in more errors that could have been prevented. Data coherence should not suffer.

Event Timeline

Gehel created this task.Aug 22 2016, 2:15 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 22 2016, 2:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Would it make sense to tune our settings the same way we tune mysql?

debt triaged this task as Medium priority.Aug 25 2016, 10:19 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added a subscriber: debt.

This looks like something we'll need to chat about that can make this run more efficiently.

Thoughts on how to tune the settings, @EBernhardson and @dcausse? Do we need to have a meeting on this?

Gehel added a comment.Aug 30 2016, 3:20 PM

I actually think that the example in the description is a good start. This will detect a failing node in a bit over 2 minutes. With the robustness improvements in T143571, we will already be in a much better situation than today.

Gehel removed Gehel as the assignee of this task.Jun 10 2020, 8:19 AM
Gehel closed this task as Declined.Fri, Jul 24, 8:30 AM

This has not been an issue recently. The current configuration does raise a few alerts in case of loss of connectivity, but it is robust and does not lead to service degradation in case of failure, which is what we need.