During recent network maintenance, elasticsearch nodes parted the cluster, resulting in a search outage. Configuration should be more robust to this kind of maintenance. Fault detection should be increased so that "standard" loss of networking does not result in nodes parting the cluster. For example:
discovery.zen.fd.ping_interval: 15s discovery.zen.fd.ping_timeout: 60s discovery.zen.fd.ping_retries: 5
The actual time needs to be defined.
The risk in increasing fault detection time would be to increase time before a dead node is detected, resulting in more errors that could have been prevented. Data coherence should not suffer.