Page MenuHomePhabricator

Install and configure new elasticsearch servers in eqiad
Closed, ResolvedPublic

Description

New elasticsearch servers have been received and racked (T129381). We can now install and configure them. And have them join our existing cluster.

Left to do:

  • remove old servers from LVS
  • reboot all servers to take the new unicast_hosts configuration into account
  • ban nodes 1001 to 1016 (es-tool ban-node)
  • shutdown nodes 1001 to 1016
  • remove references to node 1001 to 1016 from puppet

Related Objects

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Configuration of new servers was done in https://gerrit.wikimedia.org/r/#/c/294918/ (sorry, did not add bug ID to commit before merging).

elastic1032 is installed and configured. It joined the cluster without issues and is starting to receive shards. It has not yet been added to LVS.

Change 295369 had a related patch set uploaded (by Gehel):
Adding missing dependency in exposing puppet SSL certs on elasticsearch

https://gerrit.wikimedia.org/r/295369

Change 295473 had a related patch set uploaded (by Gehel):
Configuring new elastic1033-1037 servers

https://gerrit.wikimedia.org/r/295473

Change 295369 merged by Gehel:
Adding missing dependency in exposing puppet SSL certs on elasticsearch

https://gerrit.wikimedia.org/r/295369

Change 295473 merged by Gehel:
Configuring new elastic1033-1037 servers

https://gerrit.wikimedia.org/r/295473

Change 295490 had a related patch set uploaded (by Gehel):
Configuring new elastic1038-1042 servers

https://gerrit.wikimedia.org/r/295490

Change 295490 merged by Gehel:
Configuring new elastic1038-1042 servers

https://gerrit.wikimedia.org/r/295490

Change 295524 had a related patch set uploaded (by Gehel):
Configuring new elastic1043-1047 servers

https://gerrit.wikimedia.org/r/295524

Change 295524 merged by Gehel:
Configuring new elastic1043-1047 servers

https://gerrit.wikimedia.org/r/295524

Change 295536 had a related patch set uploaded (by Gehel):
Adding rack location of new elasticsearch servers

https://gerrit.wikimedia.org/r/295536

Change 295536 merged by Gehel:
Adding rack location of new elasticsearch servers

https://gerrit.wikimedia.org/r/295536

Change 295537 had a related patch set uploaded (by Gehel):
Fixed missing location of elastic1045

https://gerrit.wikimedia.org/r/295537

Change 295537 merged by Gehel:
Fixed missing location of elastic1045

https://gerrit.wikimedia.org/r/295537

Change 295585 had a related patch set uploaded (by Gehel):
Moving elasticsearch masters to new servers

https://gerrit.wikimedia.org/r/295585

Plan to deploy https://gerrit.wikimedia.org/r/295585 (move elastic masters to new servers):

  1. merge and deploy change
  2. restart one old master
  3. let the cluster go back to green
  4. restart one of the new master
  5. let the cluster go back to green
  6. GOTO 2 unless no more old masters to restart

@dcausse / @EBernhardson does this plan looks sound to you?

seems sane enough. Both ways give us opportunity for failure. I think the 3->2->3 masters route is the easier to recover from (in the relatively unlikely occurance of a network partition / master node failure)

Going from 3 masters to 2, then back to 3 risks the second master dropping out when we only have 2. I'm not completely sure but i think with minimum_master_nodes set to 2 the if we only have 2 and a network partition occurs the cluster will stop responding to requests until at least 2 masters are able to communicate again.

Going from 3 masters to 4, then back to 3 risks two groups of two masters having a network partition, and both sides thinking they are in control of the cluster. This is probably worse than the option above.

Change 295649 had a related patch set uploaded (by Gehel):
Decommission old maps servers

https://gerrit.wikimedia.org/r/295649

Change 295657 had a related patch set uploaded (by Gehel):
Add new elasticsearch servers to LVS

https://gerrit.wikimedia.org/r/295657

Change 295585 merged by Gehel:
Moving elasticsearch masters to new servers

https://gerrit.wikimedia.org/r/295585

Change 295657 merged by Gehel:
Add new elasticsearch servers to LVS

https://gerrit.wikimedia.org/r/295657

Mentioned in SAL [2016-06-27T14:34:51Z] <gehel> removing old elasticsearch servers in eqiad from LVS (elastic1001-1016 - T138329)

Mentioned in SAL [2016-06-27T15:16:08Z] <gehel> banning elastic1001 to prepare its decommissioning (T138329)

Mentioned in SAL [2016-06-29T09:23:20Z] <gehel> banning elastic1001 to 1016 from cluster to prepare their decommissioning (T138329)

Change 297274 had a related patch set uploaded (by Gehel):
Remove old elasticsearch servers from LVS

https://gerrit.wikimedia.org/r/297274

Change 297274 merged by Gehel:
Remove old elasticsearch servers from LVS

https://gerrit.wikimedia.org/r/297274

Closing this as the new elasticsearch servers are installed and serving traffic. Old servers still need to be decommissioned, but this is tracked on separate task T139758.