Currently, a full restart of the Cirrus elasticsearch clusters takes days. There are various strategies that could be applied to improve the situation. Some of them known, some not. Ideas are welcomed! In all cases, we need to find a way to test and validate those strategies, which is non trivial as the time needed for a restart is linked to cluster size, data size and load.
Some ideas:
- Now that we have row aware shard allocation, we should be able to restart a full row at a time, or at least multiple nodes.
- Tuning delayed allocation might help reduce recovery time
- ...