Page MenuHomePhabricator

Set up multi-DC replication for Cassandra
Closed, ResolvedPublic

Description

We are maintaining two datacenters with the goal of surviving a DC-wide outage, such as a power failure or fiber cut. Supporting multi-DC replication has been a design consideration for RESTBase from the start. By choosing Cassandra as the storage backend, we get to use Cassandra's relatively mature cross-DC replication support.

We are in the process of purchasing a replica cluster for codfw (see T93790). The hardware there will hopefully come online before the end of this quarter. In the meantime, we should prepare and test cross-DC replication.

We don't have a general ipsec setup between the datacenters, so we'll likely need to encrypt and strongly authenticate the cross-DC connections at the Cassandra level. Assuming one instance per hardware node (depends on T95253), testing at the full replication volume might require six nodes to keep up with compaction. Unless there are that many spares in codfw we might not be able to test this fully with the production cluster. We could however consider setting this up for the staging cluster, which has modest and controllable resource needs. Any set of three nodes (SSD or not) in codfw should be sufficient to test this in staging.

Proposed Online Migration Process

  1. Implement multi-DC-aware RESTBase keyspace creation (T76494)
  2. Update system_auth keyspace replication ({'eqiad': 9, 'codfw': 6})
  3. Set up the new nodes in codfw.
    1. Setting auto_bootstrap: false
    2. cassandra-rackdc.properties configured accordingly.
  4. Ensure that eqiad RESTBase clients do not auto-discover codfw nodes (or enable T111113: Cassandra client encryption)
  5. Ensure that localQuorum or localOne are used throughout RESTBase
  6. Start Cassandra on the new codfw nodes
  7. Alter the existing keyspaces to set replication accordingly (see T76494)
  8. Rebuild each codfw node against the eqiad DC (nodetool rebuild -- eqiad)
  9. Set auto_bootstrap: true on codfw nodes

Rationale: Rebuilds are preferred to bootstraps here because it allows us to decouple membership and data transfer; to put the modified topology in place before any data movement. If you were to bootstrap the new nodes individually, the first new node would end up with an entire cluster's worth of data (a replica for every range).

Notes:

Step 4 was meant to prevent unencrypted cross-DC client traffic in the event that the hot RESTBase instances (in eqiad) needed to fail-over connections to codfw. Such a fail-over scenario is highly unlikely, and the unencrypted traffic would transit a private link, so this step is on hold until after codfw comes on-line, (both to expedite this issue, and to simplify the transition to client encryption).

Step 8 would have streamed the nodes' data from eqiad, bringing the codfw cluster up to date with a full copy of eqiad. However, this would bring the 6 newly provisioned codfw nodes to an average of 966GB each, which is precariously close to when we began to experience problems bootstrapping. Therefore, this step will be postponed. In the absence of a complete rebuild, newly written data will however be actively replicated, and available for reads.

Once T95253 is complete, each new instance can be fully bootstrapped as has been the plan.

Event Timeline

GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke subscribed.

Considering that we have made little progress on this task, and that it's near crunch time, I propose the following:

We focus on T108953, for the purposes of encrypting inter-DC traffic only (intra-DC to be enabled at a later date). Once complete, we test with the process outlined above, using the staging environment on the eqiad side, and in codfw either the new production machines, or some spares if the new machines are not yet available. Upon successful review, we can then bring the new codfw nodes online in the production environment.

The implication here is that every other concern becomes secondary, (incl. xInstance and client encryption). Based on this, I see a tentative timeline that looks like the following:

         Internode encryption                                     Testing                             Production
         testing / implementation.                                eqiad -> codfw                      rollout.
|------------------------------------------------|         |------------------------|         |------------------------|
| 2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
|-----------------------------------------------------------------------------------|
                         Production (nodetool) repairs

xInstance (T95253) and client encryption (T111113) can be worked into this time line on a best-effort basis, the idea here is to remove them from the list of hard dependencies in order to ensure we make the end of quarter goal.

Note: if we do punt on client encryption, we'll need to use a load-balancing configuration that limits connections to the local data-center.

Ensure that eqiad RESTBase clients do not auto-discover codfw nodes

Afaik they will discover those nodes. The default DCAwareRoundRobinPolicy behavior is to use the local DC first, and only fall back to remote connections if that fails. We might be able to avoid connections to the remote DC with a custom load-balancing policy.

Ensure that eqiad RESTBase clients do not auto-discover codfw nodes

Afaik they will discover those nodes. The default DCAwareRoundRobinPolicy behavior is to use the local DC first, and only fall back to remote connections if that fails. We might be able to avoid connections to the remote DC with a custom load-balancing policy.

Or use a WhiteListPolicy, but yeah, working around this needs to be weighed against the savings of punting on this item.

Change 238135 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add auxiliary (non-seed) codfw test hosts

https://gerrit.wikimedia.org/r/238135

Change 238138 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add codfw test nodes

https://gerrit.wikimedia.org/r/238138

Change 238135 abandoned by Filippo Giunchedi:
cassandra: add auxiliary (non-seed) codfw test hosts

Reason:
see related, https://gerrit.wikimedia.org/r/#/c/238138/

https://gerrit.wikimedia.org/r/238135

Change 238138 merged by Filippo Giunchedi:
cassandra: add codfw test nodes

https://gerrit.wikimedia.org/r/238138

ok cassandra is up in codfw with encryption enabled and auto_bootstrap: false so codfw and eqiad are seeing each other (step #4)

restbase-test2002:~$ nodetool status -r
Datacenter: codfw
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                        Load       Tokens  Owns    Host ID                               Rack
UN  restbase-test2002.codfw.wmnet  476.5 KB   256     ?       4afd4cc2-fe90-4868-bb99-3bf6496d5a57  rack1
UN  restbase-test2003.codfw.wmnet  357.55 KB  256     ?       ec1d70ad-aab6-4f68-af44-2bc31b500804  rack1
UN  restbase-test2001.codfw.wmnet  586.1 KB   256     ?       a3da4b35-1a2c-4df5-acfd-5e7c774dee30  rack1
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                        Load       Tokens  Owns    Host ID                               Rack
UN  xenon.eqiad.wmnet              124.55 GB  256     ?       20ec6bb4-195b-479c-a540-e6a7c70b589d  rack1
UN  praseodymium.eqiad.wmnet       125.07 GB  256     ?       5be66290-4784-4fed-a4f1-d496ea09d803  rack1
UN  cerium.eqiad.wmnet             125.2 GB   256     ?       235934bc-33f0-4744-b374-f7c96c8edc95  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Eevans updated the task description. (Show Details)

With the 3 test nodes in codfw up, we are now on step #6 above; The next step is to deploy RESTBase 0.7.12 (restbase-mod-table-cassandra 0.8.0), in order to test the process of altering keyspace replication.

thanks @GWicke, no problem!

I noticed restbase-test2002 and restbase-2003 are experiencing much higher gc times than restbase-test2001 despite the hardware being very similar

restbase-test2002:~$ awk '$11 > 3 && /stopped/ {print }' /var/log/cassandra/gc.log.6.current  | tail -10
2015-09-16T09:34:44.357+0000: 150494.065: Total time for which application threads were stopped: 37.2599484 seconds, Stopping threads took: 0.0000919 seconds
2015-09-16T09:35:20.533+0000: 150530.241: Total time for which application threads were stopped: 35.7252108 seconds, Stopping threads took: 0.0001009 seconds
2015-09-16T09:35:20.885+0000: 2015-09-16T09:35:20.885+0000150530.592: : Total time for which application threads were stopped: 0.0148445 seconds, Stopping threads took: 0.0000682 seconds
2015-09-16T09:35:32.871+0000: 150542.579: Total time for which application threads were stopped: 11.9442972 seconds, Stopping threads took: 0.0000378 seconds
Total time for which application threads were stopped: 0.0163604 seconds, Stopping threads took: 0.0000994 seconds
2015-09-16T09:36:09.663+0000: 150579.371: Total time for which application threads were stopped: 36.0322118 seconds, Stopping threads took: 0.0000902 seconds
2015-09-16T09:36:45.283+0000: 150614.990: Total time for which application threads were stopped: 34.6218309 seconds, Stopping threads took: 0.0000938 seconds
150616.971: Total time for which application threads were stopped: 0.6611734 seconds, Stopping threads took: 0.0001122 seconds
2015-09-16T09:37:23.153+0000: 150652.861: Total time for which application threads were stopped: 35.7840382 seconds, Stopping threads took: 0.0001193 seconds
2015-09-16T09:37:36.781+0000: 150666.488: Total time for which application threads were stopped: 13.1927194 seconds, Stopping threads took: 0.0000605 seconds
restbase-test2001:~$ awk '$11 > 3 && /stopped/ {print }' /var/log/cassandra/gc.log.1.current 
2015-09-16T06:39:26.159+0000: 55081.043: Total time for which application threads were stopped: 3.8178377 seconds, Stopping threads took: 0.0001169 seconds
2015-09-16T06:39:54.645+0000: 55109.530: Total time for which application threads were stopped: 5.8848611 seconds, Stopping threads took: 0.0002625 seconds
2015-09-16T06:50:52.226+0000: 55767.110: Total time for which application threads were stopped: 4.2121900 seconds, Stopping threads took: 0.0001485 seconds
2015-09-16T07:51:03.963+0000: 59378.848: Total time for which application threads were stopped: 4.1229277 seconds, Stopping threads took: 0.0001086 seconds

investigating more, this manifests for example in the gossiper regarding nodes as DOWN

enabled hyperthreading on 2002 and rebooted, will watch for gc times if that makes a difference

another side effect of having 2001 and 2002 down is that cassandra lost quorum while authenticating/authorizing queries from restbase which in turn was spewing 500s, e.g. on xenon

ERROR [SharedPool-Worker-2] 2015-09-16 09:55:36,712 ErrorMessage.java:251 - Unexpected exception during request
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableExcept
ion: Cannot achieve consistency level QUORUM
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) ~[guava-16.0.jar:na]
        at com.google.common.cache.LocalCache.get(LocalCache.java:3934) ~[guava-16.0.jar:na]
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) ~[guava-16.0.jar:na]
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4821) ~[guava-16.0.jar:na]
        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:72) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.Auth.getPermissions(Auth.java:75) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:353) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:251) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:245) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:229) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:195) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:235) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:493) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:134) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:439) [apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:335) [apache-cassandra-2.1.8.jar:2.1.8]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23
.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.
jar:4.0.23.Final]
        at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.
23.Final]
        at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.F
inal]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_66-internal]
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) 
[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.8.jar:2.1.8]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66-internal]
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM
        at org.apache.cassandra.auth.Auth.selectUser(Auth.java:276) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:97) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:67) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:124) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:121) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3524) ~[guava-16.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2317) ~[guava-16.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2280) ~[guava-16.0.jar:na]
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2195) ~[guava-16.0.jar:na]
        ... 23 common frames omitted
Caused by: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM
        at org.apache.cassandra.db.ConsistencyLevel.assureSufficientLiveNodes(ConsistencyLevel.java:296) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.AbstractReadExecutor.getReadExecutor(AbstractReadExecutor.java:160) ~[apache-cassandra-2.1.8.jar:2.1
.8]
        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1328) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1270) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1195) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:272) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.auth.Auth.selectUser(Auth.java:265) ~[apache-cassandra-2.1.8.jar:2.1.8]
        ... 32 common frames omitted

enabled hyperthreading on 2002 and rebooted, will watch for gc times if that makes a difference

another side effect of having 2001 and 2002 down is that cassandra lost quorum while authenticating/authorizing queries from restbase which in turn was spewing 500s, e.g. on xenon

this seems to be due to CASSANDRA-5310

tl;dr being we shouldn't be using cassandra user, rather create another user which is good practice anyway

This comment was removed by Eevans.
This comment was removed by Eevans.

The rebuilds of codfw nodes is complete

Datacenter: codfw
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.192.16.149  126.24 GB  256     ?       a3da4b35-1a2c-4df5-acfd-5e7c774dee30  rack1
UN  10.192.16.150  126.25 GB  256     ?       4afd4cc2-fe90-4868-bb99-3bf6496d5a57  rack1
UN  10.192.16.151  126.75 GB  256     ?       ec1d70ad-aab6-4f68-af44-2bc31b500804  rack1
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.64.16.147   127.89 GB  256     ?       235934bc-33f0-4744-b374-f7c96c8edc95  rack1
UN  10.64.16.149   127.91 GB  256     ?       5be66290-4784-4fed-a4f1-d496ea09d803  rack1
UN  10.64.0.200    127.34 GB  256     ?       20ec6bb4-195b-479c-a540-e6a7c70b589d  rack1

I did some preliminary testing of replication correctness today.

It was my hope to use cassandra-stress for this, but I encountered issues, and instead put together a simple Python script to write some test data, and then read and verify the written data (see attached).

eevans@xenon: $ cqlsh -u cassandra -f schema-staging.cql `hostname -i`
eevans@xenon: $ sh generate.sh 100000 > test-100000.dat
eevans@xenon: $ ./write --local-dc eqiad `hostname -i` < test-100000.dat
...
eevans@xenon: $ ./read --local-dc codfw restbase-test2001.codfw.wmnet < test-100000.dat
...

This wrote 100,000 records at LOCAL_QUORUM using eqiad as the local DC, then read at LOCAL_QUORUM using codfw as the local DC. All of the reads succeeded, and returned the expected data.

nodetool cfstats output for eqiad and codfw, here and here respectively.

TL;DR

So far, so good.

Change 240060 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: add codfw production hosts

https://gerrit.wikimedia.org/r/240060

Change 240088 had a related patch set uploaded (by Filippo Giunchedi):
restbase: add LVS codfw configuration

https://gerrit.wikimedia.org/r/240088

Change 240060 merged by Filippo Giunchedi:
cassandra: add codfw production hosts

https://gerrit.wikimedia.org/r/240060

All 6 codfw nodes are now joined, and everything looks good.

Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.64.32.160   625 GB     256     ?       798ff758-8c91-46e0-b85e-dad356c46f20  b
UN  10.64.32.178   723.47 GB  256     ?       e9ab408a-309e-4e93-b145-9ac5f2365523  b
UN  10.64.48.99    652.08 GB  256     ?       325e01e8-debe-45f0-a8c2-93b3baa58968  d
UN  10.64.48.100   608.64 GB  256     ?       2abf437d-a16d-406b-a6de-8d28b7dda808  d
UN  10.64.0.220    705.44 GB  256     ?       c021a198-b7f1-4dc2-94d7-9cb8b8a8df28  a
UN  10.64.0.221    646.49 GB  256     ?       fc041cc8-cd28-4030-b29a-05b9a632cafc  a
UN  10.64.48.110   723.05 GB  256     ?       3ee62592-ef41-445a-863f-b05be9ae5ac8  d
UN  10.64.32.159   635.86 GB  256     ?       88d9ef9f-d81b-466e-babf-6a283b13f648  b
UN  10.64.0.223    623.67 GB  256     ?       c1b5a012-4840-4096-9a71-ce4d3afb0029  a
Datacenter: codfw
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  10.192.48.37   193.7 KB   256     ?       dec9c52c-9327-4707-aabd-cb8eb1f7cb21  d
UN  10.192.48.38   208.92 KB  256     ?       17291060-65d0-4096-a6a9-9d193fe1256d  d
UN  10.192.16.152  4.52 MB    256     ?       13128ee3-4d20-44a1-ab3d-c68097be030a  b
UN  10.192.16.153  6.85 MB    256     ?       ce3de07b-e055-4c5a-838e-45c920336496  b
UN  10.192.32.124  4.64 MB    256     ?       80dfa7d8-8478-4c08-b104-545f702f40e9  c
UN  10.192.32.125  456.83 KB  256     ?       7baf3975-7450-4d9a-9daf-d8fe6141ff0a  c

We're now on step #7 (see issue description above), which will require a RESTBase configuration change to add codfw to the restbase::cassandra_datacenters list (and per an earlier discussion with @fgiunchedi, we will do this tomorrow).

All 6 codfw nodes are now joined, and everything looks good.

Yupiii

Change 240578 had a related patch set uploaded (by Eevans):
WIP: configure RESTBase for codfw datacenter

https://gerrit.wikimedia.org/r/240578

Change 240578 merged by Filippo Giunchedi:
configure RESTBase for codfw datacenter

https://gerrit.wikimedia.org/r/240578

The list of datacenters in RESTBase config has been expanded to include codfw, replication has been updated, and data is now being replicated to both datacenters.

Change 240088 merged by Filippo Giunchedi:
restbase: add LVS codfw configuration

https://gerrit.wikimedia.org/r/240088

@fgiunchedi @Eevans is this task completed? If not, what's left to do? Is client-side encryption really necessary?

@Joe, we'd like to encrypt all cross-DC traffic, and some of that traffic is directly from clients to remote Cassandra nodes. We currently don't have blanket IPSec encryption of all cross-DC traffic, which means that we need to set this up at a per-service level.

also if I'm reading the documentation correctly clients will fallback to codfw only when no eqiad nodes are unreachable. WRT the "serve traffic from codfw" goal I don't think we would shut eqiad when failing over in this case

GWicke lowered the priority of this task from High to Medium.Jan 27 2016, 6:34 PM

Lowered priority as the main multi-DC goal is reached, and the main remaining bit is adding encryption for client connections.

Even though this is now unrelated to multi-DC, it has been a while since it was last worked on and it'd be nice to not lose momentum.

I'm not sure about the complexity of the remaining item — are the client libraries not capable of TLS or are there any other major issues with enabling such a feature? Any update/incorporation into future planning would be highly appreciated ­— thanks :)

"Set up multi-DC replication for Cassandra" is a bit misleading nowadays — that part is done. The remaining item is essentially client encryption, which is tracked separately as T111113 (a blocker for this task). Per consensus with the services team, I'll resolve this task.