When invoked, the rolling-operation cookbook changes the setting cluster.routing.allocation.enable from all to primaries. We do this to prevent a lot of changes to the cluster state during operation. After the operation, we need to change back to all as soon as possible, otherwise we risk losing data as no new replica shards can be created.
Currently, when the cookbook fails, it leaves the cluster.routing.allocation.enable set to primaries. This is OK, because we generally rerun the cookbook until it succeeds. However, in certain circumstances, we might forget. We have the same issue with the load balancer pools. If the cookbook fails, it leaves the host(s) depooled. This is what we want, but we should warn the user regardless.
Creating this ticket to:
- Add a warning to the cookbook failure message
- Create monitors and alerts for cluster.routing.allocation.enable settings. Our typical rolling restarts take around 2-3 hours, so we should probably alert around the 4 hour mark or so.