Page MenuHomePhabricator

Make elasticsearch actually uses shard allocation awareness
Closed, ResolvedPublic

Description

Quick look at the elasticsearch configuration let me think that we do set rack and row node attributes, but we do not actually use them to spread shards across racks / row.

It seem that we do not set cluster.routing.allocation.awareness.attributes. We probably want to set cluster.routing.allocation.awareness.attributes: rack but that will probably move a lot of shards around and we need some way to check how well balanced the cluster is with this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt claimed this task.
debt triaged this task as Medium priority.
debt subscribed.

This looks like this was already done - closing for now.

Gehel renamed this task from Check that elasticsearch actually uses shard allocation awareness to Make elasticsearch actually uses shard allocation awareness.Aug 29 2016, 2:00 PM
Gehel reopened this task as Open.

I'm reopening this and changed the title. Previous title was just about checking configuration, but this should really be about correcting it, which is not yet done.

debt moved this task from needs triage to Up Next on the Discovery-Search board.

Thanks!

Change 308561 had a related patch set uploaded (by Gehel):
elasticsearch - enable row aware shard allocation

https://gerrit.wikimedia.org/r/308561

Allocation will be row aware, not rack aware. Spreading shards across row will ensure that they are also spread across rack, and elasticsearch does not seem to support multi-dimensional allocation awareness.

Discussion with @dcausse:

It is not entirely clear from the documentation what happens if we have more shards than awareness zones. On one hand it seems that it should be OK unless we enable forced awareness. On the other, there is a note in the docs:

When the number of nodes in groups is unbalanced and there are many replicas, replica shards may be left unassigned.

This note probably applies to the case where we have very different number of nodes in each allocation zone.

The risk with enabling allocation awareness is that suddenly some shards can't be allocated. This would have a negative impact on performance, but should not impact H/A much (we should still have up to 4 copies of the same shard).

I will enable allocation awareness on Tuesday Sept 6 in the morning so that we have the full day to keep a close eye on shard allocation.

Mentioned in SAL [2016-09-07T05:38:52Z] <gehel> enabling row aware allocation on elasticsearch codfw - T143571

Mentioned in SAL [2016-09-07T07:00:52Z] <gehel> increase cluster_concurrent_rebalance on elasticsearch codfw - T143571

Mentioned in SAL [2016-09-07T12:58:36Z] <gehel> enabling row aware allocation on elasticsearch eqiad - T143571

Change 308561 merged by Gehel:
elasticsearch - enable row aware shard allocation

https://gerrit.wikimedia.org/r/308561

Row awareness allocation is active on both eqiad and codfw cluster. File based configuration is updated. Both clusters are showing no sign of distress related to this change. This task can be closed.

Additional notes:
We *should* be able to restart a full row at a time during cluster restarts now that we ensure that shards are spread across multiple rows. This still bring some level of risk as we would run on 1/4 of the cluster for a short time. This is a very interesting possibility as it would allow much faster cluster restart, but we need to come up with a test plan first...