Page MenuHomePhabricator

Investigate reducing number of servers in the elasticsearch cluster
Closed, ResolvedPublic

Description

We might not be able to buy as many servers as expected for the codfw cluster refresh, so we need to investigate the impact of reducing the number of machines in the cluster.

Testing strategy:

  • depool + ban older servers in the eqiad cluster
  • check the impact

Things we expect to be problematic at some point:

  • shard allocation: the current sharding strategy will need to be adapted to less machines, resharding large indices will be needed
  • latency might increase with less resources
  • we need to keep enough headroom to be able to loose a few servers accidentally

Mitigations:

  • we can disable a number of functionalities to reduce the load at the cost of functional degradation
  • since this reduced cluster size is only for codfw, we could keep the same configuration atm and have a plan to reduce functionalities if a switch to codfw is needed

Related Objects

StatusAssignedTask
ResolvedGehel

Event Timeline

Gehel triaged this task as Normal priority.Oct 23 2018, 8:22 AM
Gehel created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2018, 8:22 AM

Mentioned in SAL (#wikimedia-operations) [2018-10-23T09:31:41Z] <gehel> depooling / banning elastics1017 and 1022 - T207724

Mentioned in SAL (#wikimedia-operations) [2018-10-23T12:29:40Z] <gehel> depooling / banning elastics1028 and 1030 - T207724

Mentioned in SAL (#wikimedia-operations) [2018-10-23T13:22:37Z] <gehel> depooling / banning elastics1018 - T207724

Mentioned in SAL (#wikimedia-operations) [2018-10-23T13:43:51Z] <gehel> depooling / banning elastics1029 - T207724

Mentioned in SAL (#wikimedia-operations) [2018-10-23T13:48:59Z] <gehel> depooling / banning elastics1031 - T207724

Mentioned in SAL (#wikimedia-operations) [2018-10-23T14:02:32Z] <gehel> repooling / banning elastics1031 - T207724

Gehel added a comment.EditedOct 23 2018, 2:40 PM

With 6 servers depooled / banned, the cluster seems to be just fine. Starting at 7 nodes depooled, I see the load rising on some of the other servers. The response times don't show any significant change.

Note that there is still 2 dewiki_content and 1 enwiki_content shard on the depooled servers, so not all the load has been transferred.

Given that eqiad has overall less capacity than codfw (some older servers and 35 nodes vs 36) and that we have some possibilities to degrade functionalities to reduce load, this looks already like a good indication that we could remove 6 servers from codfw and still be fine.

Next steps:

  • keep the eqiad cluster running on 29 nodes and under surveillance
  • tweak the total_shards_per_node to move the remaining shards away from the banned nodes
  • evaluate the results
Gehel added a comment.Oct 24 2018, 5:30 PM

Actually, already some pool counter errors with 29 nodes on eqiad.

All nodes are repooled. This give us a base line where I would be mostly confident with 30 servers on codfw, given that we have strategies to reduce the load if needs be.

Gehel mentioned this in Unknown Object (Task).Oct 24 2018, 5:34 PM
debt closed this task as Resolved.Nov 2 2018, 10:05 PM