Page MenuHomePhabricator

Upgrade aqs cluster to Cassandra 4.1.1
Closed, ResolvedPublic

Description

Upgrade aqs to Cassandra 4.1.1


Upgrade steps
  1. Override cassandra:settings in host-specific hiera files (hieradata/hosts/aqs*.yaml) and set:
    1. internode_encryption: all
    2. server_encryption_optional: true
    3. legacy_ssl_storage_port_enabled: true
    4. target_version: '4.x'
  2. c-foreach-nt snapshot --tag 3_11_14
  3. sudo rm /etc/cassandra-[a-z]/service-enabled
  4. Merge Puppet changeset
  5. sudo run-puppet-agent
  6. Restart each instance (id=X; sudo touch /etc/cassandra-$id/service-enabled && sudo systemctl restart cassandra-$id)
  7. check-aqs
NOTE: The aqs service logging seems to be broken; Prior to upgrade, locally hack /etc/aqs/config.yaml to add a logging type of stdout (restart the service), and use journalctl -f _UID=498 to ensure there are no errors.

codfw
  • a_c
    • aqs2001
    • aqs2002
    • aqs2003
    • aqs2004
  • b_e
    • aqs2005
    • aqs2006
    • aqs2007
    • aqs2008
  • c_f
    • aqs2009
    • aqs2010
    • aqs2011
    • aqs2012
eqiad
  • rack1
    • aqs1010
    • aqs1013
    • aqs1016
    • aqs1019
  • rack2
    • aqs1011
    • aqs1014
    • aqs1017
    • aqs1020
  • rack3
    • aqs1012
    • aqs1015
    • aqs1018
    • aqs1021

Post-upgrade steps
  • Move per-host hiera settings back to role (no-op)
  • Set profile::cassandra::monitor_tls_port: 7000 ¹
  • Set legacy_ssl_storage_port_enabled: false (remove assignment)
  • Set server_encryption_optional: false (remove assignment)
  • Clear snapshots
¹ Push out the change, run puppet on cluster nodes, then alert[1-2]001.wikimedia.org, and verify before continuing. Failure to do so will result in alerts.

Related Objects

StatusSubtypeAssignedTask
ResolvedEevans
ResolvedEevans

Event Timeline

Eevans triaged this task as Medium priority.

I spent a bit of time trying to run the test suite against Cassandra 4.1.1, but failed. The tests rely on having the database wiped between iterations —the keyspaces are dropped— and the code that (re)creates schema does not work on 4.1.1 (it (ab)uses system tables that have since changed).

I spent a bit of time trying to run the test suite against Cassandra 4.1.1, but failed. The tests rely on having the database wiped between iterations —the keyspaces are dropped— and the code that (re)creates schema does not work on 4.1.1 (it (ab)uses system tables that have since changed).

After a bit more finagling, I was able to get some clean test runs (all 281 tests pass).

For this to work you have to have Cassandra running, and cqlsh in your PATH.

You have to have Docker running, and the aqs-test-local image (run make docker).

You have to patch test/utils/cleandb.sh to change the system keyspace/table name (see: cleandb.sh.patch below).

Finally:

bash test/utils/cleandb.sh \
    && cqlsh -f aqs_test_schema.cql \
    && docker run --rm -it --network host -v $(pwd):/work -w /work aqs-test-local node_modules/mocha/bin/mocha

By creating the expected schema ahead of time (ala aqs_test_schema.cql), the code path broken by the newer Cassandra version is never invoked, and the tests proceed.

Obviously nothing here constitutes a "fix" (either for properly running the test suite, or schema creation), but it at least seems to demonstrate that the driver is compatible with Cassandra 4.1.



If there are no objections, I'd like to begin an upgrade on Monday (2023-08-21).

/cc @Milimetric @BTullis

Change 951127 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade aqs2001 to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951127

Change 951127 merged by Eevans:

[operations/puppet@production] aqs: upgrade aqs2001 to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951127

Mentioned in SAL (#wikimedia-operations) [2023-08-21T14:32:33Z] <urandom> Upgrading aq2001/cassandra-a (canary) to Cassandra 4.1.1 — T339299

Mentioned in SAL (#wikimedia-operations) [2023-08-21T14:39:24Z] <urandom> Upgrading aq2001/cassandra-b (canary) to Cassandra 4.1.1 — T339299

Change 951145 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade aqs1010 to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951145

Change 951145 merged by Eevans:

[operations/puppet@production] aqs: upgrade aqs1010 to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951145

Mentioned in SAL (#wikimedia-operations) [2023-08-21T18:56:34Z] <urandom> Upgrading aq1010/cassandra-{a,b} (canary) to Cassandra 4.1.1 — T339299

Change 951476 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade rack1 nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951476

Change 951478 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade rack2 nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951478

Change 951479 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade rack3 nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951479

Change 951476 merged by Eevans:

[operations/puppet@production] aqs: upgrade rack1 nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951476

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:37:01Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs101[6,9].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:44:56Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs101[6,9].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Eevans updated the task description. (Show Details)

Change 951478 merged by Eevans:

[operations/puppet@production] aqs: upgrade rack2 nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951478

Mentioned in SAL (#wikimedia-operations) [2023-08-22T14:50:47Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[11,14,17,20].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T15:07:23Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[11,14,17,20].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Change 951479 merged by Eevans:

[operations/puppet@production] aqs: upgrade rack3 nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951479

Mentioned in SAL (#wikimedia-operations) [2023-08-22T15:12:28Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs10[12,15,18,21].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T15:29:17Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs10[12,15,18,21].eqiad.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Change 951535 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade codfw/a_c nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951535

Change 951536 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade codfw/b_c nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951536

Change 951537 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: upgrade codfw/c_f nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951537

Change 951535 merged by Eevans:

[operations/puppet@production] aqs: upgrade codfw/a_c nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951535

Mentioned in SAL (#wikimedia-operations) [2023-08-22T16:58:44Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs200[2-4].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T17:13:38Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs200[2-4].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Change 951536 merged by Eevans:

[operations/puppet@production] aqs: upgrade codfw/b_c nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951536

Mentioned in SAL (#wikimedia-operations) [2023-08-22T17:30:00Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs200[5-8].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T17:48:44Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs200[5-8].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Change 951537 merged by Eevans:

[operations/puppet@production] aqs: upgrade codfw/c_f nodes to Cassandra 4.1.1

https://gerrit.wikimedia.org/r/951537

Mentioned in SAL (#wikimedia-operations) [2023-08-22T20:44:17Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[09-12].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T21:02:27Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs20[09-12].codfw.wmnet: Upgrade Cassandra to 4.1.1 — T339299 - eevans@cumin1001

Eevans updated the task description. (Show Details)

Change 951585 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: move per-host settings back to role

https://gerrit.wikimedia.org/r/951585

Change 951585 merged by Eevans:

[operations/puppet@production] aqs: move per-host settings back to role

https://gerrit.wikimedia.org/r/951585

Change 951589 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] aqs: set legacy ssl port & optional encryption to false

https://gerrit.wikimedia.org/r/951589

Change 951589 merged by Eevans:

[operations/puppet@production] aqs: set legacy ssl port & optional encryption to false

https://gerrit.wikimedia.org/r/951589

Mentioned in SAL (#wikimedia-operations) [2023-08-22T22:59:43Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Disable legacy SSL port — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T23:45:04Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Disable legacy SSL port — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-22T23:49:52Z] <eevans@cumin1001> START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Disable legacy SSL port — T339299 - eevans@cumin1001

Mentioned in SAL (#wikimedia-operations) [2023-08-23T00:35:58Z] <eevans@cumin1001> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Disable legacy SSL port — T339299 - eevans@cumin1001

This upgrade is for all intents and purposes, complete; The only remaining item is to cleanup the snapshots.

I can't see us having to rollback at this stage —but out of an abundance of caution— I will leave the snapshots in place until Monday (Aug. 28), and close this ticket after they've been cleaned up.

Mentioned in SAL (#wikimedia-operations) [2023-08-28T20:27:48Z] <urandom> clear pre-upgrade aqs snapshots — T339299

Eevans claimed this task.