Page MenuHomePhabricator

Figure out if nodes in different DCs can be bootstrapped in parallel
Closed, ResolvedPublic

Description

In order to speed up the conversion to the multi-instance setup & finish the cluster expansion, it would be useful to bootstrap nodes in different DCs in parallel. I asked the cassandra-users mailing list about this, and the reply suggests that this should indeed work:

http://mail-archives.apache.org/mod_mbox/cassandra-user/201603.mbox/%3CCA%2BVSrLoXb7m0Ww8x7zYdtqrnu%2B-fu4e0e1hbszHM7h0xwtAypg%40mail.gmail.com%3E

So, I am proposing to try the following on the staging cluster:

  1. start a bootstrap of another instance on one of the eqiad nodes, and
  2. while that is running, bootstrap another instance in codfw.

Event Timeline

I asked the cassandra-users mailing list about this, and the reply suggests that this should indeed work:

http://mail-archives.apache.org/mod_mbox/cassandra-user/201603.mbox/%3CCA%2BVSrLoXb7m0Ww8x7zYdtqrnu%2B-fu4e0e1hbszHM7h0xwtAypg%40mail.gmail.com%3E

This answer is wrong, the relevant Cassandra issue where this is discussed is: https://issues.apache.org/jira/browse/CASSANDRA-2434

This is also consistent with the Datastax docs (though oddly enough the doc suggests you can bootstrap more than one in a rack, which I also think is wrong, (and so do others; it has been reported as a bug)).

TL;DR This can be done, but requires bypassing a safety meant to preserve consistency guarantees. Given the liberties we've taken in the past, maybe there is a precedent to do that, but given the aspirations to use RESTBase as more than a durable cache, we're going to need to start taking this seriously eventually.

@Eevans: As discussed before on IRC, I don't see https://issues.apache.org/jira/browse/CASSANDRA-2434 spelling out specific reasons for not allowing bootstrapping in two DCs. Instead, it is noted that with NTS, the node giving up token ranges will always be in the same DC: https://issues.apache.org/jira/browse/CASSANDRA-2434?focusedCommentId=13094846&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13094846

Which technical reasons to you see for bootstrapping in different DCs to be an issue, given that the token ranges of those bootstraps won't overlap?

Which technical reasons to you see for bootstrapping in different DCs to be an issue, given that the token ranges of those bootstraps won't overlap?

I'm not sure what you mean when you say this. They do overlap. There is only one token space, and a bootstrapping node is going to (assuming perfect distribution), bisect 256 existing partitions (which with high probability will include some number of partitions from every node in the cluster).

Eevans moved this task from Backlog to Next on the Cassandra board.

After looking at this further, I believe it is the case that we can safely bootstrap two nodes in parallel, so long as a) each of them is in a distinct datacenter, and b) no writes are performed with a consistency level that would span these datacenters. It is the case in our environment (for the moment), that (b) always applies¹.

Relative safety aside, this will still not work without some intervention. When Cassandra starts up to perform a bootstrap, it checks gossip state to see if any other nodes are in a JOINING state. If any are, the node will refuse to bootstrap, regardless of the relationship between the two nodes (vis-a-vis NTS). This is the aforementioned "safety" designed to protect consistency during range movements, and the reasoning is sound, as consistency cannot be guaranteed under all supported consistency levels, (in other words, I see nothing here that requires follow-up with upstream).

We can choose to override the constraint using -Dcassandra.consistent.rangemovement=false at startup on a case-by-case basis, for those range movements that we know are safe (read: when conditions (a) and (b) above both apply).

[1]: The only exception I could come up with is authentication operations that involve a write for the superuser, which are implicitly performed with at QUORUM.

Eevans claimed this task.