Page MenuHomePhabricator

Only use newer (elastic10{16..47}) servers as master capable elasticsearch nodes
Closed, ResolvedPublic

Description

ES is currently configured with three master capable nodes: elastic1001, elastic1008 and elastic1013. This is one server in rack A3 and two servers in rack C5. With the new elastics1032+ hardware installed we should now be able to distribute master capability across any set of 3 machines, mainly it just needs to be moved away from the machines that are being removed.

Event Timeline

EBernhardson assigned this task to chasemp.
EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a subscriber: EBernhardson.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I doubt chase will do the actual re-racking of servers, but i've assigned this for now because i'm not sure how to move this forward.

Deskana triaged this task as Medium priority.Sep 14 2015, 8:15 PM
Deskana added a subscriber: Deskana.

Removing this from the Discovery-Search (Current work) as @EBernhardson and I don't believe this is actionable on our end. Let me know if this is incorrect.

chasemp added a subscriber: chasemp.

I am not going to get to this within the week or even next week. Placing up for grabs to reflect that. Hopefully we can knock this out in a few weeks but it's a somewhat careful and time-consuming operation at this moment.

Change 251024 had a related patch set uploaded (by EBernhardson):
Make three of the newer ES nodes master eligable

https://gerrit.wikimedia.org/r/251024

Change 251025 had a related patch set uploaded (by EBernhardson):
Remove old ES nodes from master capable list

https://gerrit.wikimedia.org/r/251025

It seems this fell by the wayside as chase and I were both otherwise occupied. I think we should be able to merge the first patch above, reboot each of the instances, then merge the second and reboot the three old instances.

Is there a disadvantage to having 4 eligible masters? I know we have a minimum viability setting right now of 2 which means if we do maint on 1 eligible we are in a real danger zone. We haven't been paying this much heed and I'm wondering if we shouldn't just make

1030
1031
1024
1025

all eligible (specifics negotiable)

With four nodes we will need to increase discovery.zen.minimum_master_nodes to 3 to ensure there is no split brain. Basically this is no better than having three master capable nodes. We could go to 5 with a minimum of 3 i suppose. I can't think of any reason in particular that would be bad,

According to racktables, we have new elasticsearch hardware in row A and D, but not in row C. The servers which are planned to be the new eligible masters (1020, 1030, 1031 - see https://gerrit.wikimedia.org/r/#/c/251024/) are in D/D3, A/A3, A/A3. Let's wait for new elasticsearch servers (T128000), rack them in rows which make sense and switch the eligible masters only then...

Change 251024 abandoned by EBernhardson:
Make three of the newer ES nodes master eligable

Reason:
with these servers being replaced, this patch is no longer necessary

https://gerrit.wikimedia.org/r/251024

Change 251025 abandoned by EBernhardson:
Remove old ES nodes from master capable list

Reason:
the old nodes are all being replaced, this patch is no longer necessary

https://gerrit.wikimedia.org/r/251025

These nodes are being removed from the cluster, as they have out lived their warranty. The new nodes are a closer match to the rest of the cluster making this unnecessary.

EBernhardson updated the task description. (Show Details)
EBernhardson renamed this task from Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes to Only use newer (elastic10{16..47}) servers as master capable elasticsearch nodes.Jun 20 2016, 3:41 PM

This is also partly tracked as part of T138329. Masters have moved to new servers (elastic1030, 1036 and 1040). Cluster restart is in progress discovery will now use those new servers, but this actually more related to the decommission of elastic1001-1016 than the move of masters. We can consider this as resolved.