Page MenuHomePhabricator

Unable to bootstrap restbase1030-{a,b,c}
Closed, ResolvedPublic

Description

After extended Cassandra node outages —the result of storage device issues— the instances hosted on restbase1030 where removed, and the host re-imaged to Bullseye. Afterward, attempts to re-bootstrap the instances resulted in high client error rates, the result of a UnavailableExceptions like the following:

Server timeout during read query at consistency LOCAL_QUORUM (2 replica(s) responded over 3 required)

This error seems to be saying that: a quorum was requested, but could not be met, that "only" 2 replicas were found, where 3 were required. This is quite strange since these keyspaces use a replication factor of 3 (per data-center), making a quorum 2 replicas. 3 replicas would be a quorum for replication factors of 4 or 5.

Event Timeline

I tried to reproduce this by using siege to generate traffic to https://staging.svc.eqiad.wmnet:8081 while simultaneously decommissioning/bootstrapping nodes (without success).

Mentioned in SAL (#wikimedia-operations) [2023-10-04T13:14:28Z] <urandom> Cassandra bootstrap, restbase1030-a (auto_bootstrap: false) — T346803

Mentioned in SAL (#wikimedia-operations) [2023-10-04T14:16:45Z] <urandom> starting Cassandra rebuild, restbase1030-a — T346803

Mentioned in SAL (#wikimedia-operations) [2023-10-04T22:02:40Z] <urandom> starting Cassandra rebuild, restbase1030-b — T346803

Mentioned in SAL (#wikimedia-operations) [2023-10-05T13:32:26Z] <urandom> starting Cassandra rebuild, restbase1030-c — T346803

restbase1030-c is having issues rejoining the cluster - it appears to hit this error, give up and then retry https://phabricator.wikimedia.org/P52849

restbase1030-c is having issues rejoining the cluster - it appears to hit this error, give up and then retry https://phabricator.wikimedia.org/P52849

Thanks @hnowlan.

I had left a full repair running last evening. It failed (not enough file descriptors), causing the JVM to exit in a less-than-graceful manner —a corrupt transaction log seems to have been the result. I've moved the offending file(s) out of the way, increased nofile (temporarily), and restarted the node. It is currently working its way through an enormous compaction backlog (due to the repair), but seems stable now (I'll continue monitoring it though).

Eevans triaged this task as Medium priority.Oct 6 2023, 3:54 PM

Change 964072 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: add utility wrapper & instance symlinks for sstableutil

https://gerrit.wikimedia.org/r/964072

Eevans claimed this task.

I was not able to reproduce this in the dev cluster, and wasn't able to uncover a root cause from restbase1030 (at least not without exceeding my comfort level for experimentation in a production setting). I got the instances back on-line by (temporarily) setting auto_bootstrap: false, performing a nodetool rebuild, and finally a full repair (for good measure).

I'm going to assume for now that this was the result of how the instances were removed (which should be avoided anyway), and that we aren't likely to see this again. If it does recurr, we may need to spend more time getting to the bottom of it.

Change 964072 merged by Eevans:

[operations/puppet@production] cassandra: add utility wrapper & instance symlinks for sstableutil

https://gerrit.wikimedia.org/r/964072

Change 965521 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: fix incorrect path to sstable utilities

https://gerrit.wikimedia.org/r/965521

Change 965521 merged by Eevans:

[operations/puppet@production] cassandra: fix incorrect path to sstable utilities

https://gerrit.wikimedia.org/r/965521