Page MenuHomePhabricator

Warn user about cluster replication settings/LB pool status after elasticsearch rolling-operation cookbook fails; alert for cluster replication settings
Open, MediumPublic

Description

When invoked, the rolling-operation cookbook changes the setting cluster.routing.allocation.enable from all to primaries. We do this to prevent a lot of changes to the cluster state during operation. After the operation, we need to change back to all as soon as possible, otherwise we risk losing data as no new replica shards can be created.

Currently, when the cookbook fails, it leaves the cluster.routing.allocation.enable set to primaries. This is OK, because we generally rerun the cookbook until it succeeds. However, in certain circumstances, we might forget. We have the same issue with the load balancer pools. If the cookbook fails, it leaves the host(s) depooled. This is what we want, but we should warn the user regardless.

Creating this ticket to:

  • Add a warning to the cookbook failure message
  • Create monitors and alerts for cluster.routing.allocation.enable settings. Our typical rolling restarts take around 2-3 hours, so we should probably alert around the 4 hour mark or so.

Event Timeline

bking renamed this task from Warn user about cluster replication settings after elasticsearch rolling-operation cookbook fails/alert for cluster replication settings to Warn user about cluster replication settings/LB pool status after elasticsearch rolling-operation cookbook fails; alert for cluster replication settings.Jul 8 2025, 2:35 PM
bking updated the task description. (Show Details)
Gehel triaged this task as Medium priority.Sep 9 2025, 2:20 PM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.