Page MenuHomePhabricator

Test OpenSearch cluster behavior when we add non-existent hosts to static config
Closed, ResolvedPublic

Description

We need to figure out what happens when we add a nonexistent host to the Elasticsearch static config (yaml file) and restart the service? Will the service be OK as long as it can reach X amount of master-eligibles, or will it spin forever?

Creating this ticket to:

  • Test in relforge (or another non-prod environment if Relforge doesn't work)

Details

Other Assignee
RKemper

Event Timeline

I've tested this on the relforge-small-alpha cluster by adding a garbage1001.eqiad.wmnet to the list of discovery.zen.ping.unicast.hosts in its opensearch.yml config file and restarting the service. Observations:

  • The service starts cleanly, but cannot join the cluster.
  • Other nodes in the same cluster are not affected, even after a service restart, as long as they have valid configuration in discovery.zen.ping.unicast.hosts.

Some thoughts about what this means for our migration, specifically about how to handle the masters:

  • We are migrating row-by-row
  • Needs confirmation The rolling-operation cookbook deliberately avoids touching the masters until the very end of the cookbook

Scenario 1: The cookbook does as expected, and we migrate 100% of the non-master hosts. The only hosts left on Elastic are the master-eligibles. What are the risks? Do we care?

Scenario 2: The cookbook does not consider master status, and we start losing masters as we migrate. It takes a cluster restart for new masters to be recognized. In that case, we would probably using the voting configuration exclusions API to prevent quorum issues while we're reimaging. We could also follow our procedure for adding new masters. It'll be slow because we'll need to roll restart the entire cluster, but it should work.

Note: here's a one-liner to get all master-eligibles from the API

Some thoughts about what this means for our migration, specifically about how to handle the masters:

  • We are migrating row-by-row
  • Needs confirmation The rolling-operation cookbook deliberately avoids touching the masters until the very end of the cookbook

Scenario 1: The cookbook does as expected, and we migrate 100% of the non-master hosts. The only hosts left on Elastic are the master-eligibles. What are the risks? Do we care?

Scenario 2: The cookbook does not consider master status, and we start losing masters as we migrate. It takes a cluster restart for new masters to be recognized. In that case, we would probably using the voting configuration exclusions API to prevent quorum issues while we're reimaging. We could also follow our procedure for adding new masters. It'll be slow because we'll need to roll restart the entire cluster, but it should work.

Note: here's a one-liner to get all master-eligibles from the API

Spicerack will not do the masters until all non-masters in all rows have been done: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/spicerack/elasticsearch_cluster.py#295

We've confirmed what @RKemper said in his last statement:

Spicerack will not do the masters until all non-masters in all rows have been done

I've also inadvertently reproduced the problem with non-existent masters in production (fixed by this CR) . As such, I think we've learned what we need to know from this ticket. Closing...