Page MenuHomePhabricator

rename cassandra cluster
Closed, ResolvedPublic

Description

We've been running both test/production clusters with the same name, that at the moment "works" just because the seeds are kept separated as well. We should rename the test cluster to something sane, a proposed procedure is at https://stackoverflow.com/questions/22006887/cassandra-saved-cluster-name-test-cluster-configured-name
a gerrit patch is coming up as well

Event Timeline

fgiunchedi claimed this task.
fgiunchedi raised the priority of this task from to High.
fgiunchedi updated the task description. (Show Details)

Change 237643 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: adjust test cluster name

https://gerrit.wikimedia.org/r/237643

I suppose updating the system.local table will happen before applying the patch?

I suppose updating the system.local table will happen before applying the patch?

Based on a cursory code-dive, I'd say the stackoverflow cited above has it right. In fact, it would seem that the attribute in system.local is only used at startup to validate that configuration hasn't changed since cluster initialization, everything else that cares uses the value directly from the config. So, yeah, a) first update system.local, then b) update config.

Of course it goes without saying that altering the cluster name after-the-fact is unsupported, so I for one am happy to be trying this for the first time on a test cluster. :)

Change 237643 merged by Filippo Giunchedi:
cassandra: adjust test cluster name

https://gerrit.wikimedia.org/r/237643

rename has been successful, it involved following the above procedure and rolling-restart the cluster. of course since this is only three machines quorum was lost, after all nodes were up again there was full recovery

fgiunchedi renamed this task from rename cassandra test cluster to rename cassandra cluster.Sep 22 2015, 5:04 PM
fgiunchedi reopened this task as Open.
fgiunchedi lowered the priority of this task from High to Medium.
fgiunchedi removed a project: Patch-For-Review.
fgiunchedi set Security to None.
fgiunchedi edited subscribers, added: GWicke; removed: gerritbot.

reopening, the cassandra configuration template uses cluster_name: %{::site} which is an unfortunate choice because it means cassandra clusters in different sites won't talk to each other even if they should (like in the restbase case).

The rename has been carried out in test cluster for safety reasons because nodes with the same cassandra cluster_name can have membership information gossiped to each other if they are sharing seed nodes.
Going forward with the expansion there are at least two options myself and @Eevans have been talking about:

  • rename the production cluster to something other than eqiad, this entails doing the procedure outlined above for each node in turn and return errors to cassandra clients in the process but the full consequences are not known at this time
  • match cluster_name: eqiad in codfw too, this is the easiest and safest option at this point. There is some potential for confusion, however cluster_name is specified in the configuration as a safety net to prevent already initialized nodes to join different clusters or have multiple clusters coexist for example when with more dynamic discovery methods.

reopening, the cassandra configuration template uses cluster_name: %{::site} which is an unfortunate choice because it means cassandra clusters in different sites won't talk to each other even if they should (like in the restbase case).

The rename has been carried out in test cluster for safety reasons because nodes with the same cassandra cluster_name can have membership information gossiped to each other if they are sharing seed nodes.
Going forward with the expansion there are at least two options myself and @Eevans have been talking about:

  • rename the production cluster to something other than eqiad, this entails doing the procedure outlined above for each node in turn and return errors to cassandra clients in the process but the full consequences are not known at this time
  • match cluster_name: eqiad in codfw too, this is the easiest and safest option at this point. There is some potential for confusion, however cluster_name is specified in the configuration as a safety net to prevent already initialized nodes to join different clusters or have multiple clusters coexist for example when with more dynamic discovery methods.

I prefer the latter:

  • Changing the cluster name is unsupported; As @fgiunchedi says, the impact isn't fully understood (and should be before attempting such a thing in production). It absolutely will introduce a split-brain scenario for some window of time.
  • This name isn't used anywhere else. It isn't exposed to any of the management tools. It exists for the sole purpose of preventing you from crossing clusters, and could just as easily be set to the first 12 printable characters from /dev/random. Having this string be 'eqiad' is unfortunate, and will almost certainly jab at my OCD (and others), but it's as good as any other for this purpose (from a technical standpoint).

I prefer the latter:

I tend to agree. Entering risky procedures at this point would be plain foolish.

Having this string be 'eqiad' is unfortunate, and will almost certainly jab at my OCD (and others),

I feel your pain, bro. Better to fire and forget :)

Oh, and let's please document this in ops/puppet where we set the cluster name to eqiad.

Change 240321 had a related patch set uploaded (by Filippo Giunchedi):
WIP: cassandra: stop setting cluster_name as %{::site}

https://gerrit.wikimedia.org/r/240321

I prefer the latter:

I tend to agree. Entering risky procedures at this point would be plain foolish.

Yes it would be. I say, let's see if we can sketch out a procedure to actually do that in whatever is the correct way, thus making the procedure less risky. After that, we should be able to define a point in time where we can actually do the renaming safely. This will probably entail some downtime btw

Change 240321 merged by Filippo Giunchedi:
cassandra: stop setting cluster_name as %{::site}

https://gerrit.wikimedia.org/r/240321

fgiunchedi changed the task status from Open to Stalled.Nov 20 2015, 9:40 AM

stalling as figuring out a procedure to rename safely is still needed but we've agreed to leave eqiad alone for now

fgiunchedi added a subscriber: fgiunchedi.
fgiunchedi lowered the priority of this task from Medium to Low.Jul 19 2017, 12:58 PM
Eevans claimed this task.

All clusters have a unique name (have for some time); Closing this ticket.