Page MenuHomePhabricator

Update to Cassandra 2.1.12
Closed, ResolvedPublic

Assigned To
Authored By
GWicke
Dec 8 2015, 6:02 AM
Referenced Files
F3087201: pasted_file
Dec 14 2015, 3:21 PM
F3064918: pasted_file
Dec 10 2015, 4:24 PM
F3064913: pasted_file
Dec 10 2015, 4:23 PM
F3057948: pasted_file
Dec 8 2015, 8:46 AM
F3057951: pasted_file
Dec 8 2015, 8:46 AM

Description

Cassandra 2.1.12 brings several welcome improvements especially related to DTCS:

  • Limit window size in DTCS (CASSANDRA-10280)
  • Do STCS in DTCS windows (CASSANDRA-10276)

Together, these two changes will likely reduce the compaction load significantly, which in turn reduces latency & lets us push density further.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: SRE.
GWicke subscribed.

After testing 2.1.12 on cerium for a while I gradually proceeded to roll it out to the eqiad staging hosts, followed by restbase1007. That looked good after an hour, with significantly less compaction throughput and iowait, no doubt by using STCS within each compaction window. I then proceeded to upgrade the struggling nodes in eqiad, followed by the remaining eqiad nodes.

Except for one spike on 1007, the latency trend is looking promising:

pasted_file (1×1 px, 451 KB)

Compaction IO is generally a lot lower:

pasted_file (1×1 px, 634 KB)

I don't see anything regarding this unscheduled upgrade in SAL, why is that?

After testing 2.1.12 on cerium for a while I gradually proceeded to roll it out to the eqiad staging hosts, followed by restbase1007. That looked good after an hour, with significantly less compaction throughput and iowait, no doubt by using STCS within each compaction window. I then proceeded to upgrade the struggling nodes in eqiad, followed by the remaining eqiad nodes.

For some value of a while, I guess, (2.1.12 was released yesterday).

It should also be pointed out that we are now conducting range movements (a bootstrap), in a mixed version environment, a cardinal sin of Cassandra ops.

eevans@agenor:~/dev/src/git/mediawiki/core(master)$ dsh -M -g c_prod -- "apt-cache policy cassandra | grep -i installed"
restbase1001.eqiad.wmnet:   Installed: 2.1.12
restbase1002.eqiad.wmnet:   Installed: 2.1.12
restbase1003.eqiad.wmnet:   Installed: 2.1.12
restbase1004.eqiad.wmnet:   Installed: 2.1.12
restbase1005.eqiad.wmnet:   Installed: 2.1.12
restbase1006.eqiad.wmnet:   Installed: 2.1.12
restbase1007.eqiad.wmnet:   Installed: 2.1.12
restbase1008.eqiad.wmnet:   Installed: 2.1.8
restbase1009.eqiad.wmnet:   Installed: 2.1.12
restbase2001.codfw.wmnet:   Installed: 2.1.8
restbase2002.codfw.wmnet:   Installed: 2.1.8
restbase2003.codfw.wmnet:   Installed: 2.1.8
restbase2004.codfw.wmnet:   Installed: 2.1.8
restbase2005.codfw.wmnet:   Installed: 2.1.8
restbase2006.codfw.wmnet:   Installed: 2.1.8
eevans@agenor:~/dev/src/git/mediawiki/core(master)$ ssh restbase1001.eqiad.wmnet nodetool status -r
Datacenter: codfw
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                     Load       Tokens  Owns    Host ID                               Rack
UN  restbase2001-a.codfw.wmnet  275.68 GB  256     ?       7a071b47-a5cb-4051-bd80-c9a6b11c469a  b
UN  restbase2001-b.codfw.wmnet  296.43 GB  256     ?       2acb7c90-c929-44a0-baa2-c50526cb08a5  b
UN  restbase2001-c.codfw.wmnet  291.97 GB  256     ?       dac7667c-05e4-4c75-8ccd-a27e56b5c614  b
UN  restbase2005.codfw.wmnet    858.13 GB  256     ?       dec9c52c-9327-4707-aabd-cb8eb1f7cb21  d
UN  restbase2002-a.codfw.wmnet  237.28 GB  256     ?       9c67467f-3e48-436c-8ee3-b3d44816a7e5  b
UN  restbase2002-b.codfw.wmnet  312.21 GB  256     ?       3c5fdd02-d97c-4385-8ed4-3ee578b75255  b
UN  restbase2006.codfw.wmnet    842.5 GB   256     ?       17291060-65d0-4096-a6a9-9d193fe1256d  d
UN  restbase2002-c.codfw.wmnet  271.7 GB   256     ?       f902b388-6143-4ad9-99e5-a379d11ac315  b
UN  restbase2003.codfw.wmnet    827.38 GB  256     ?       80dfa7d8-8478-4c08-b104-545f702f40e9  c
UN  restbase2004.codfw.wmnet    868.89 GB  256     ?       7baf3975-7450-4d9a-9daf-d8fe6141ff0a  c
Datacenter: eqiad
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                     Load       Tokens  Owns    Host ID                               Rack
UN  restbase1004.eqiad.wmnet    1.85 TB    256     ?       798ff758-8c91-46e0-b85e-dad356c46f20  b
UN  restbase1005.eqiad.wmnet    1.5 TB     256     ?       325e01e8-debe-45f0-a8c2-93b3baa58968  d
UN  restbase1006.eqiad.wmnet    1.31 TB    256     ?       2abf437d-a16d-406b-a6de-8d28b7dda808  d
UN  restbase1009-a.eqiad.wmnet  665.33 GB  128     ?       0ca32463-de76-40f1-b0c0-715430dab2f7  d
UJ  restbase1008-a.eqiad.wmnet  17.99 GB   128     ?       e2813bb9-f1f2-4d21-ac19-95a7a35b4513  b
UN  restbase1001.eqiad.wmnet    1.08 TB    256     ?       c021a198-b7f1-4dc2-94d7-9cb8b8a8df28  a
UN  restbase1002.eqiad.wmnet    1.02 TB    256     ?       fc041cc8-cd28-4030-b29a-05b9a632cafc  a
UN  restbase1003.eqiad.wmnet    1.45 TB    256     ?       88d9ef9f-d81b-466e-babf-6a283b13f648  b
UN  restbase1007.eqiad.wmnet    1012.82 GB  256     ?       c1b5a012-4840-4096-9a71-ce4d3afb0029  a

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

1008 bootstrapped successfully, and is now also upgraded to 2.1.12.

General metrics are continuing to look significantly better with 2.1.12, with less than half iowait and a lower number of sstables in steady state operation from using STCS within time windows. We are now also able to limit the window size to two months, which reduces the write amplification by implicitly limiting the maximum SSTable size.

@fgiunchedi, could you bootstrap 1007 with 2.1.12 right away once it has finished decommissioning? Lets also import the 2.1.12 packages into reprepo, so that we can upgrade all boxes in the cluster.

Latency improvement since the upgrade:

pasted_file (1×1 px, 650 KB)

iowait:

pasted_file (1×1 px, 575 KB)

Effect of upgrading to 2.1.12 in codfw:

pasted_file (3×1 px, 1 MB)

I don't know how ti will affect our use of cassandra (heavy load once a day), but we can definitely try :)

@JAllemandou: It will likely reduce compaction costs, but the effect should be smaller for you. Upgrading is a matter of apt-get install cassandra, leaving the config untouched. Keyspace settings also benefit from some tweaking, although the defaults should already work quite well for you.

This ticket is resolved for RESTBase.

This leaves AQS (@JAllemandou) and the maps service (@Yurik or @MaxSem?).

Upgrade instructions:

sudo apt-get install cassandra cassandra-tools
# press enter on prompts, keeping the existing config files

Neither of us have access, so leaving it to @akosiaris. From IRC: Upgrade one server at a time:

<gwicke> yes, just make sure to wait long enough after each upgrade for the server to fully re-join the cluster
<gwicke> in doubt, wait 5 minutes
<gwicke> or check `nodetool status`

Same issuer for me: I'm not root on AQS machines :(
Either @akosiaris or @Ottomata ?

I 've upgraded the cassandra on maps-test200{1,2,3,4}.codfw.wmnet and everything seems fine.

Btw, I must point out that somehow the cassandra devs managed to break the cqlsh tool functionality between 2.1.8 and 2.1.12.

If you try to connect to a cassandra running 2.1.12 node from a node still having the 2.1.8 package installed you get

Connection error: ('Unable to connect to any servers', {'maps-test2004.codfw.wmnet': ProtocolError("cql_version '3.2.0' is not supported by remote (w/ native protocol). Supported versions: [u'3.2.1']",)})

Same thing happens if you try to connect from a node having 2.1.12 installed to a node running 2.1.8. Of course the version protocols are swapped in the above message (3.2.0 <=> 3.2.1)

While it is quite immediately obvious what's going on and one can adapt, I can't really understand why this would happen on what most people would assume is just 4 patch level versions.

upgrade completed on aqs100[123], also note that due to how much data these nodes have ATM on spinning disks it takes ~20min for each node to start back up, we've seen restbase throwing 500s while this is happening

This comment was removed by fgiunchedi.
GWicke claimed this task.

note that due to how much data these nodes have ATM on spinning disks it takes ~20min for each node to start back up, we've seen restbase throwing 500s while this is happening

I think AQS is configured to read with localQuorum by default, which means that each of the two remaining nodes need to return a result within the timeout. If that doesn't happen in time (2s by default), a 500 is returned as no quorum can be reached in time. This is more likely if IO bandwidth is mostly taken up by compactions. 2.1.12 should reduce compaction load a bit.

I think it's also worth considering using a read consistency of ONE for AQS, as the consistency requirements in this use case are fairly low / the information is basically static & primarily additive.

As AQS becomes more popular, random read IO will likely become the bottleneck. At that point, it's probably time to replace spinning disks with SSDs.

I'm closing this task as done. Thank you, @fgiunchedi and @akosiaris!

I agree bottleneck will be on IOs, and might come sonn depending on expected response times.
Also, read consistency of one could indeed be set to one.

Thanks @GWicke, @fgiunchedi and @akosiaris :)