Page MenuHomePhabricator

Decrease time required to fully restart the Cirrus elasticsearch clusters
Closed, ResolvedPublic

Description

Currently, a full restart of the Cirrus elasticsearch clusters takes days. There are various strategies that could be applied to improve the situation. Some of them known, some not. Ideas are welcomed! In all cases, we need to find a way to test and validate those strategies, which is non trivial as the time needed for a restart is linked to cluster size, data size and load.

Some ideas:

  • Now that we have row aware shard allocation, we should be able to restart a full row at a time, or at least multiple nodes.
  • Tuning delayed allocation might help reduce recovery time
  • ...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Sep 8 2016, 10:18 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Some notes of a discussion with @EBernhardson and @dcausse:

  • Our most promising option is probably tuning delayed allocation. Delayed allocation needs to be long enough to allow shards to fully recover before being moved, but short enough so that if a node is down for real, shards are relocated. Raising it to 10 minutes is probably a good start.
  • Freezing writes and flushing indices before restarting a node should help. This step was removed from the restart procedure as it did not improve the situation, we need to revisit this.
  • Tuning of initial recovery settings (number of recoveries, throttling, ...) might help.
  • While restarting a full row at a time is carries a large risk, and probably some performance impact on reads, restarting 2 nodes in the same row at the same time should be safe. We need some experimentation to find the optimal number of nodes to restart at the same time.

Improvements seems unlikely:

  • delayed allocation does not seem to work as expected. It delays the time before shards start being reallocated, but does not reduce the time it takes to reallocate them.
  • Freezing writes and flushing indices has no impact on recovery time.
  • The current throttling configuration we have already has some impact on response time during recovery. It does not seem very wise to increase the limit much.
  • Restarting multiple nodes at a time requires improvement to the current automation. We might want to revisit this at a later point.

Conclusion: we looked, we tried and we failed. In the process, the restart themselves have been better automated and are less painful.

Re-opening this and linking it to upstream ticket: https://github.com/elastic/elasticsearch/issues/21884

Let's see if he good folks at elastic have more idea than we have.

Mentioned in SAL (#wikimedia-operations) [2016-12-05T10:27:41Z] <gehel> enabling trace logging on indices recovery on elasticsearch codfw - T145065

In the latest cluster restart, we did manage to restart a few nodes in < 3 minutes when writes are disabled. There is evidence that writes are happening in the cluster during write freeze (the sync_id changes while writes are frozen). The current scripts have a lot of wait time to ensure writes are processed between node restarts, this needs to be improved to actually gain time during cluster restarts.