We are maintaining two datacenters with the goal of surviving a DC-wide outage, such as a power failure or fiber cut. Supporting multi-DC replication has been a design consideration for RESTBase from the start. By choosing Cassandra as the storage backend, we get to use Cassandra's relatively mature cross-DC replication support.
We are in the process of purchasing a replica cluster for codfw (see T93790). The hardware there will hopefully come online before the end of this quarter. In the meantime, we should prepare and test cross-DC replication.
We don't have a general ipsec setup between the datacenters, so we'll likely need to [encrypt and strongly authenticate](http://docs.datastax.com/en/cassandra/2.1/cassandra/security/secureSSLCertificates_t.html) the cross-DC connections at the Cassandra level. Assuming one instance per hardware node (depends on T95253), testing at the full replication volume might require six nodes to keep up with compaction. Unless there are that many spares in codfw we might not be able to test this fully with the production cluster. We could however consider setting this up for the staging cluster, which has modest and controllable resource needs. Any set of three nodes (SSD or not) in codfw should be sufficient to test this in staging.
= Proposed Online Migration Process =
# Update `system_auth` keyspace replication (`{'eqiad': 9, 'codfw': 3}`)
# Set up the nodes in codfw.
## Setting `auto_bootstrap: false`
## `cassandra-rackdc.properties` configured accordingly.
# Ensure that eqiad RESTBase clients do not auto-discover codfw nodes (or enable {T111113})
# Ensure that `localQuorum` or `localOne` are used throughout RESTBase
# Start Cassandra on the new codfw nodes
# Alter the existing keyspaces to set replication accordingly
# Rebuild each codfw node against the eqiad DC (`nodetool rebuild -- eqiad`)
# Set `auto_bootstrap: true` on codfw nodes
# Implement multi-DC-aware RESTBase keyspace creation
= Alternate Proposed Online Migration Process =
# Ensure that eqiad RESTBase clients will not auto-discover codfw nodes (or enable {T111113})
# Ensure that either `localQuorum` or `localOne` are used throughout RESTBase
# Implement multi-DC-aware RESTBase keyspace creation
# Alter existing keyspaces to set replication accordingly
# Update `system_auth` keyspace replication (`{'eqiad': 9, 'codfw': 3}`)
# For-each node in codfw-nodes:
## Set `auto_bootstrap: false`
## Configure `cassandra-rackdc.properties` accordingly
## Start Cassandra
## Rebuild node against the eqiad DC (`nodetool rebuild -- eqiad`)
## Set `auto_bootstrap: true`