Page MenuHomePhabricator

Decrease time required to fully restart the Cirrus elasticsearch clusters
Closed, ResolvedPublic


Currently, a full restart of the Cirrus elasticsearch clusters takes days. There are various strategies that could be applied to improve the situation. Some of them known, some not. Ideas are welcomed! In all cases, we need to find a way to test and validate those strategies, which is non trivial as the time needed for a restart is linked to cluster size, data size and load.

Some ideas:

  • Now that we have row aware shard allocation, we should be able to restart a full row at a time, or at least multiple nodes.
  • Tuning delayed allocation might help reduce recovery time
  • ...

Event Timeline

Gehel created this task.Sep 8 2016, 12:10 PM
Restricted Application added a project: Discovery. · View Herald TranscriptSep 8 2016, 12:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Sep 8 2016, 10:18 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
Gehel added a comment.Sep 9 2016, 8:57 AM

Some notes of a discussion with @EBernhardson and @dcausse:

  • Our most promising option is probably tuning delayed allocation. Delayed allocation needs to be long enough to allow shards to fully recover before being moved, but short enough so that if a node is down for real, shards are relocated. Raising it to 10 minutes is probably a good start.
  • Freezing writes and flushing indices before restarting a node should help. This step was removed from the restart procedure as it did not improve the situation, we need to revisit this.
  • Tuning of initial recovery settings (number of recoveries, throttling, ...) might help.
  • While restarting a full row at a time is carries a large risk, and probably some performance impact on reads, restarting 2 nodes in the same row at the same time should be safe. We need some experimentation to find the optimal number of nodes to restart at the same time.
Gehel claimed this task.Sep 12 2016, 3:39 PM
debt moved this task from Up Next to This Quarter on the Discovery-Search board.
Gehel closed this task as Declined.Oct 25 2016, 5:39 PM

Improvements seems unlikely:

  • delayed allocation does not seem to work as expected. It delays the time before shards start being reallocated, but does not reduce the time it takes to reallocate them.
  • Freezing writes and flushing indices has no impact on recovery time.
  • The current throttling configuration we have already has some impact on response time during recovery. It does not seem very wise to increase the limit much.
  • Restarting multiple nodes at a time requires improvement to the current automation. We might want to revisit this at a later point.

Conclusion: we looked, we tried and we failed. In the process, the restart themselves have been better automated and are less painful.

Gehel reopened this task as Open.Nov 30 2016, 2:05 PM

Re-opening this and linking it to upstream ticket:

Let's see if he good folks at elastic have more idea than we have.

Mentioned in SAL (#wikimedia-operations) [2016-12-05T10:27:41Z] <gehel> enabling trace logging on indices recovery on elasticsearch codfw - T145065

Gehel added a comment.Dec 12 2016, 3:24 PM

In the latest cluster restart, we did manage to restart a few nodes in < 3 minutes when writes are disabled. There is evidence that writes are happening in the cluster during write freeze (the sync_id changes while writes are frozen). The current scripts have a lot of wait time to ensure writes are processed between node restarts, this needs to be improved to actually gain time during cluster restarts.

TJones closed this task as Resolved.Jan 29 2019, 6:39 PM