We are maintaining two datacenters with the goal of surviving a DC-wide outage, such as a power failure or fiber cut. Supporting multi-DC replication has been a design consideration for RESTBase from the start. By choosing Cassandra as the storage backend, we get to use Cassandra's relatively mature cross-DC replication support.
We are in the process of purchasing a replica cluster for codfw (see T93790). The hardware there will hopefully come online before the end of this quarter. In the meantime, we should prepare and test cross-DC replication.
We don't have a general ipsec setup between the datacenters, so we'll likely need to encrypt and strongly authenticate the cross-DC connections at the Cassandra level. Assuming one instance per hardware node (depends on T95253), testing at the full replication volume might require six nodes to keep up with compaction. Unless there are that many spares in codfw we might not be able to test this fully with the production cluster. We could however consider setting this up for the staging cluster, which has modest and controllable resource needs. Any set of three nodes (SSD or not) in codfw should be sufficient to test this in staging.
Proposed Online Migration Process
- Implement multi-DC-aware RESTBase keyspace creation (T76494)
- Update system_auth keyspace replication ({'eqiad': 9, 'codfw': 6})
- Set up the new nodes in codfw.
- Setting auto_bootstrap: false
- cassandra-rackdc.properties configured accordingly.
Ensure that eqiad RESTBase clients do not auto-discover codfw nodes (or enable T111113: Cassandra client encryption)- Ensure that localQuorum or localOne are used throughout RESTBase
- Start Cassandra on the new codfw nodes
- Alter the existing keyspaces to set replication accordingly (see T76494)
Rebuild each codfw node against the eqiad DC (nodetool rebuild -- eqiad)Set auto_bootstrap: true on codfw nodes
Rationale: Rebuilds are preferred to bootstraps here because it allows us to decouple membership and data transfer; to put the modified topology in place before any data movement. If you were to bootstrap the new nodes individually, the first new node would end up with an entire cluster's worth of data (a replica for every range).
Notes:
Step 4 was meant to prevent unencrypted cross-DC client traffic in the event that the hot RESTBase instances (in eqiad) needed to fail-over connections to codfw. Such a fail-over scenario is highly unlikely, and the unencrypted traffic would transit a private link, so this step is on hold until after codfw comes on-line, (both to expedite this issue, and to simplify the transition to client encryption).
Step 8 would have streamed the nodes' data from eqiad, bringing the codfw cluster up to date with a full copy of eqiad. However, this would bring the 6 newly provisioned codfw nodes to an average of 966GB each, which is precariously close to when we began to experience problems bootstrapping. Therefore, this step will be postponed. In the absence of a complete rebuild, newly written data will however be actively replicated, and available for reads.
Once T95253 is complete, each new instance can be fully bootstrapped as has been the plan.