Page MenuHomePhabricator

🟣 Elasticsearch nodes should only become ready once the cluster is healthy
Closed, ResolvedPublic

Description

Elasticsearch by default marks a node as ready the moment it's able to accept traffic. Unfortunately this makes it possible for a node to marked as ready before it actually finished loading all of its indices. We deploy changes in our cluster by replacing one node at a time and waiting for this new node to spin up successfully before proceeding to replace the next node. Our Wikibase instances are also configured to have a total of two copies of their indices stored on two different Elasticsearch nodes. The issue we are experiencing is once a node is marked as ready, but the indices are not finished loading, our cluster has already begun to replace the next node. As a result some of our instances now have no copies of their indices available because one copy is still being loaded while the other copy is on a node that is being replaced.

We can add an additional healthcheck to Elasticsearch that only marks the node as ready once the cluster is also healthy "green" again. This will prevent the next node from being removed until all indices are loaded and healthy again. However this check must only be run as a startupProbe. Running it as either a readinessProbe or livenessProbe could either stop all traffic or kill all nodes when a single index becomes unhealthy "yellow/red".

Patches:

Event Timeline

Andrew-WMDE renamed this task from 🟣 Elasticsearch nodes only ready when cluster is healthy to 🟣 Elasticsearch nodes should only become ready once the cluster is healthy.Nov 16 2023, 1:36 PM
Evelien_WMDE claimed this task.