Page MenuHomePhabricator

Cassandra 3.x Tracking
Closed, ResolvedPublic

Description

Cassandra 3 (currently 3.11 / 3.0.12) has a number of significant improvements, including completely rewritten storage, materialized views, and time-windowed compaction.

As part of T156199: Low-latency current revision storage, we are considering a migration strategy that would downsize the current cluster to free up enough hardware nodes that we could setup a secondary cluster, deploy the updated storage module, and migrate use-cases to it incrementally. If we decide on this approach, it would provide a unique opportunity to deploy this new cluster with Cassandra 3, and avoid the high costs associated with an in-place upgrade later.

The first step is to research outstanding issues, attempt to identify potential risks to our use-cases, and select a version to begin testing with.

Cassandra Releases

2.2.x

We are currently using 2.2.6 in production (the latest is 2.2.9). The 2.2 series took quite a bit longer to release than was planned; It was held up by ambitious features that took longer to get completed than expected (most notably, a complete rewrite of storage). At some point, a decision to release with the (stable) features that had accumulated was made, rather than to make users continue to wait. The net result is that the 2.2 series carried fewer new features than previous major releases; It's basically a bridge between 2.1 and 3.x (which is what should have been the next major release).

The 2.2 series is set to be EOL with the release of 4.0 (TBD).

3.0.x

Starting with v3, the Cassandra project moved to a tick-tock release cycle. Releases are made monthly, even-numbered releases contain new features and bug fixes, odd-numbered releases contain bug fixes only. The idea was to keep trunk as close to releasable at all times, and to get new features (and fixes) out to users for testing more incrementally. Since it was recognized that this might take some time to achieve, the 3.0 release was special-cased to receive all bug-fixes from the 3.x series.

The 3.0 series (currently at 3.0.12) is set to be EOL 6 months after 4.0 (TBD).

3.x (tick-tock releases)

Amongst the various tick-tock versions (currently up to 3.10), there are two of interest for what they introduce. Version 3.2 introduced a feature that pins one compactor to each data directory, and 3.6 which made row indexing more memory efficient, (effectively raising the limit on partition width).

4.0

Tick-tock releases have not been popular, so beginning with version 4.0, they are being abandoned. Additionally, after recent changes within the project, sentiment seems to be shifting strongly in the direction of slowing the pace of destabilizing change, and concentrating on code debt and testability. There is no firm release date for 4.0 yet.

Selecting a Version

Prior to the 3.x tick-tock releases, Cassandra was somewhat legendary for needing several minor bugfix releases to a new version before even considering it for production use. Once tick-tock started, things got more complicated, as iterations were as likely to introduce new regressions as they were to solve them. Community consensus seems to be that once you start down the tick-tock path, it's best to ride it through to the end, upgrading to each new odd-numbered release. However, consensus also seems to suggest that this is not common; Most people who choose to run Cassandra 3 in production avoid the tick-tock releases and run 3.0.

3.0.12

Having received primarily bug fixes, and no disruptive changes since the original 3.0 release, this is the safest of the 3.x line. However, the most significant benefit it would provide is to simply get us on the 3.x train so that we might avoid an in-place upgrade from a 2.2 release.

3.7.3-instaclustr

Most users with non-trivial deployments accept a greater degree of ownership of the code they run; It isn't uncommon to maintain builds from Cassandra forks where critical bugfixes are backported. Instaclustr does this, and has published a fork of 3.7 + fixes (a so-called LTS). They are hopeful that the 3.x releases are stabilizing, and may end maintenance of their branch in favor of a version >= 3.10 release.

3.10

The most current of the 3.x tick-tock releases; The first release made in the wake of recent upstream changes that seem to have triggered a sentiment that it is time to take our foot off the gas, concentrate on stabilization, code debt, and testability.

3.11.0

https://lists.apache.org/thread.html/b3afcf80349e29957704560a5931c5fbd7a8bc8126c3658fefb16cc1@%3Cdev.cassandra.apache.org%3E


It should go without saying, that upgrading to any of Cassandra 3 at this time is somewhat of a daring move, these are still considered (to varying degrees), "bleeding edge". The opportunity to start with a new cluster (side-stepping any upgrade issues), and to migrate use cases incrementally is what buys us the leeway to be a little daring. There is however the real possibility that problems could arise only after we are more committed. If (when?) we were to find ourselves in this position, given the bleeding edge nature of these releases, it would fall upon us more to work on software-based solutions, a proposition we have been resistant to in the past when it comes to Cassandra.

Recommendation (@Eevans)

Start with 3.7.3-instaclustr, and commit to a reasonable effort in making it work. If 3.7.3-instaclustr proves too dicey, drop back to 3.0.12, (and of course, if necessary, 2.2.6).

I have done some preliminary testing of 3.7.3-instacluster, it so far seems solid, and is the 3.x version that Instaclustr offers its customers. I make a point of saying "reasonable effort" here only because a >= 3.7 release is required to take advantage of important enhancements to compaction and row indexing; Falling back to 3.0.12 will mean we see significantly less benefit from this, and so avoiding this outcome is worth a commensurate level of additional effort.

We should plan to closely monitor what is happening in successive 3.x releases (>= 3.10), work with upstream and/or Instclustr, and tentatively plan to upgrade when Instaclustr does (i.e. when they EOL their LTS).

Finally, we should only embark on this with the knowledge that these are less hardened releases, and be prepared to roll up our sleeves and get our hands dirty if any unforeseen issues become apparent at a later stage.

Current Status

The dev and newly minted 3.x production environment are running patched builds of Cassandra 3.11.0 (see the README for our fork).

Outstanding Issues

Related Objects

Event Timeline

Eevans triaged this task as Medium priority.Mar 15 2017, 8:29 PM
Eevans renamed this task from Cassandra 3.0.x to Cassandra 3.x.Mar 21 2017, 7:16 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans moved this task from Next to In-Progress on the Cassandra board.

Change 349668 had a related patch set uploaded (by Eevans):
[operations/puppet@production] WIP: Create a Cassandra 3.7 configuration

https://gerrit.wikimedia.org/r/349668

Change 349668 merged by Filippo Giunchedi:
[operations/puppet@production] Create a Cassandra 3.7 configuration

https://gerrit.wikimedia.org/r/349668

Change 350249 had a related patch set uploaded (by Eevans):
[operations/puppet@production] Assign a hints directory (Cassandra >= 3.0)

https://gerrit.wikimedia.org/r/350249

Change 350249 merged by Filippo Giunchedi:
[operations/puppet@production] Assign a hints directory (Cassandra >= 3.0)

https://gerrit.wikimedia.org/r/350249

Eevans renamed this task from Cassandra 3.x to Cassandra 3.x Tracking.May 22 2017, 3:58 PM

Mentioned in SAL (#wikimedia-operations) [2017-05-30T18:37:19Z] <urandom> T160570: Upgrading dev env to Cassandra 3.11 (snapshot)

Change 357882 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Use Cassandra version that corresponds to what is being tested

https://gerrit.wikimedia.org/r/357882

Change 357882 merged by Dzahn:
[operations/puppet@production] Use Cassandra version that corresponds to what is being tested

https://gerrit.wikimedia.org/r/357882

Mentioned in SAL (#wikimedia-operations) [2017-06-08T21:42:33Z] <urandom> T160570: Rolling Cassandra restart, restbase-dev

Mentioned in SAL (#wikimedia-operations) [2017-06-26T18:32:48Z] <urandom> T160570: Upgrading restbase-dev1001 to Cassandra 3.11.0 (release)

Mentioned in SAL (#wikimedia-operations) [2017-06-26T19:30:54Z] <urandom> T160570: Upgrading restbase-dev1002 to Cassandra 3.11.0 (release)

Mentioned in SAL (#wikimedia-operations) [2017-06-26T19:35:15Z] <urandom> T160570: Upgrading restbase-dev1003 to Cassandra 3.11.0 (release)

Mentioned in SAL (#wikimedia-operations) [2017-09-18T18:56:59Z] <urandom> T160570: Upgrading restbase-dev1004.eqiad.wmnet to Cassandra 3.11.0-wmf5 (canary)

Mentioned in SAL (#wikimedia-operations) [2017-09-18T19:10:10Z] <urandom> T160570: Upgrading restbase-dev100[5-6].eqiad.wmnet to Cassandra 3.11.0-wmf5

Change 378926 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Upgrade Cassanra build to 3.11.0-wmf5

https://gerrit.wikimedia.org/r/378926

Change 378926 merged by Dzahn:
[operations/puppet@production] cassandra: bump dev cluster version to 3.11.0-wmf5

https://gerrit.wikimedia.org/r/378926

Mentioned in SAL (#wikimedia-operations) [2017-09-19T18:35:21Z] <urandom> T160570, T169940: Upgrade restbase-ng environment to Cassandra 3.11.0-wmf5

Change 378983 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Upgrade restbase-ng env to Cassandra 3.11.0-wmf5

https://gerrit.wikimedia.org/r/378983

Change 378983 merged by Dzahn:
[operations/puppet@production] cassandra: Upgrade restbase-ng env to Cassandra 3.11.0-wmf5

https://gerrit.wikimedia.org/r/378983

This ticket's scope (which I think has lately started to become muddled), was originally to facilitate a discussion of which version to utilize for T156199: Low-latency current revision storage, and subsequently, to track the testing and deployment of that version. At this time, Cassandra 3.11.0 is deployed to a cluster of 6 machines, and is serving live traffic for both mobileapps, and feeds, so I think it reasonable to consider this done and close the ticket.

T177621: Apache Cassandra Tracking has been opened for general purpose tracking of outstanding issues with Cassandra itself.