Make elasticsearch configuration more robust to loss of network connectivity
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Gehel
	Aug 22 2016, 2:15 PM

Description

During recent network maintenance, elasticsearch nodes parted the cluster, resulting in a search outage. Configuration should be more robust to this kind of maintenance. Fault detection should be increased so that "standard" loss of networking does not result in nodes parting the cluster. For example:

discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

The actual time needs to be defined.

The risk in increasing fault detection time would be to increase time before a dead node is detected, resulting in more errors that could have been prevented. Data coherence should not suffer.

Related Objects

Mentioned Here: T143571: Make elasticsearch actually uses shard allocation awareness

Event Timeline

Gehel created this task.Aug 22 2016, 2:15 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 22 2016, 2:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Would it make sense to tune our settings the same way we tune mysql?

This looks like something we'll need to chat about that can make this run more efficiently.

Thoughts on how to tune the settings, @EBernhardson and @dcausse? Do we need to have a meeting on this?

I actually think that the example in the description is a good start. This will detect a failing node in a bit over 2 minutes. With the robustness improvements in T143571, we will already be in a much better situation than today.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:48 PM

Gehel moved this task from This Quarter to Ops / SRE on the Discovery-Search board.Jan 29 2019, 6:52 PM

Gehel removed Gehel as the assignee of this task.Jun 10 2020, 8:19 AM

This has not been an issue recently. The current configuration does raise a few alerts in case of loss of connectivity, but it is robust and does not lead to service degradation in case of failure, which is what we need.

Make elasticsearch configuration more robust to loss of network connectivityClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Make elasticsearch configuration more robust to loss of network connectivity
Closed, DeclinedPublic
Actions