We realized that elasticsearch is not configured to balance shards across the different rows. To enable this and keep the shards balanced across nodes, we need to improve how nodes themselves are balanced across rows. In particular, we have 17 nodes on row D, which caused some level of pain when we lost networking on row D.
Current situation:
A 3: elastic10(30|31|32|33|34|35) - 6 nodes
B 3: elastic10(36|37|38|39) - 4 nodes
C 5: elastic10(40|41|42|43) - 4 nodes
D 3: elastic10(17|18|19|20|21|22) - 6 nodes
D 4: elastic10(23|24|25|26|27|28|29|44|45|46|47) - 11 nodes
Current master eligible: elastic10(30|36|40)
(masters have to be spread across different rows as well)
Procedure to move node around, to be done in 2 batches (4 nodes + 5 nodes):
- ban the nodes to be moved from the cluster: es-tool ban-node <IP_of_node_to_ban>
- move the nodes
- update [[ https://github.com/wikimedia/operations-puppet/blob/production/hieradata/regex.yaml | regex.yaml ]] with new row information
- update IP configuration (DNS, DHCP, ...) and documentation (racktables, ...) (we probably have a checklist somewhere, but I can't find it)
- preemptively ban the new IP of the nodes to prevent it joining the cluster before being reprovisioned
- reprovision the nodes
- unban all nodes from the cluster: es-tool unban-node <IP_of_node_to_ban>
Notes:
- we definitely don't want to take more than 6 nodes out of the cluster at the same time
- a rolling restart of the eqiad cluster is in progress, we need to wait for it to be done before moving servers around