Page MenuHomePhabricator

Cassandra quorum read timeouts during node decommissions
Open, Stalled, MediumPublic

Assigned To
None
Authored By
Eevans
Mar 20 2024, 5:03 PM
Referenced Files
F43777569: image.png
Fri, Mar 29, 4:02 PM
F43071959: image.png
Mar 22 2024, 7:17 PM
F42959085: image.png
Mar 21 2024, 5:10 PM
Subscribers

Description

After today's data-center switchover (codfw to eqiad) elevated errors were observed for changeprop. These errors were bubbling up from RESTBase, and ultimately Cassandra:

changeprop
{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/update_error","title":"Internal error in Cassandra table storage backend","method":"get","uri":"/de.wikipedia.org/v1/page/html/Benutzer%3AHerzi_Pinki%2FsyncTest/243290995","internalURI":"https://restbase-async.discovery.wmnet:7443/de.wikipedia.org/v1/page/html/Benutzer%3AHerzi_Pinki%2FsyncTest/243290995","internalMethod":"get"}
restbase
ResponseError: Server timeout during read query at consistency LOCAL_QUORUM (2 replica(s) responded over 3 required)
    at FrameReader.readError (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/readers.js:326:15)
    at Parser.parseBody (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:194:66)
    at Parser._transform (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:137:10)
    at Parser.Transform._read (_stream_transform.js:191:10)
    at Parser.Transform._write (_stream_transform.js:179:12)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Parser.Writable.write (_stream_writable.js:318:11)
    at Protocol.ondata (_stream_readable.js:718:22)
    at Protocol.emit (events.js:314:20)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Protocol.Readable.push (_stream_readable.js:213:10)
    at Protocol.Transform.push (_stream_transform.js:152:32)
    at Protocol.readItems (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:109:10)
    at Protocol._transform (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:32:10)

This error indicates a read timeout while attempting a LOCAL_QUORUM, specifically that only two copies responded during the timeout period, when at least 3 were required. What is puzzling here, is that each data-center only holds 3 replicas, so a LOCAL_QUORUM is in fact 2 copies!

Working hypothesis

There was a decommission running in eqiad at the time of the switchover (T354561). Decommissions (and bootstraps) temporarily increase the replication by one (to include the node joining/leaving), and enforce consistency level requirements for writes. This is necessary to maintain consistency guarantees and bootstrap/decommission atomicity. The failed operations were reads, but these tables have blocking read repair enabled which means that (blocking) writes can be triggered (probabilisticly?) by a read. This would help explain the biggest mystery, namely: Why a read operation would fail with such an unexpected quorum requirement.

This still does not explain why this larger than expected quorum would fail; Even a quorum of 3 should succeed in an otherwise healthy cluster (the odd transient/spurious failure notwithstanding). CASSANDRA-19129 might offer the final piece to this puzzle.

Next steps

Terminate the decommission to see if the errors cease. If they do, disable blocking read-repair and attempt another decommission. If the errors do not return, we can assume CASSANDRA-19129 is the culprit, and either backport the patch, or wait for a 4.1.5 release.

Update: Terminating the decommission stopped the read timeouts.
Update: Disabling read-repair did not make the errors disappear entirely, but the rate of errors dropped quite low.

Event Timeline

Eevans triaged this task as High priority.Mar 20 2024, 5:05 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
disable read repair
1ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
2ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
3ALTER TABLE "commons_T_page__summary".data WITH read_repair = 'NONE';
4ALTER TABLE "commons_T_page__summary".meta WITH read_repair = 'NONE';
5ALTER TABLE "commons_T_parsoidphp".data WITH read_repair = 'NONE';
6ALTER TABLE "commons_T_parsoidphp".meta WITH read_repair = 'NONE';
7ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'NONE';
8ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'NONE';
9ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".data WITH read_repair = 'NONE';
10ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".meta WITH read_repair = 'NONE';
11ALTER TABLE echostore.values WITH read_repair = 'NONE';
12ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'NONE';
13ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'NONE';
14ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
15ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
16ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'NONE';
17ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'NONE';
18ALTER TABLE "enwiki_T_page__summary".data WITH read_repair = 'NONE';
19ALTER TABLE "enwiki_T_page__summary".meta WITH read_repair = 'NONE';
20ALTER TABLE "enwiki_T_parsoidphp".data WITH read_repair = 'NONE';
21ALTER TABLE "enwiki_T_parsoidphp".meta WITH read_repair = 'NONE';
22ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'NONE';
23ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'NONE';
24ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'NONE';
25ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'NONE';
26ALTER TABLE "globaldomain_T_mathoid__ng_check".data WITH read_repair = 'NONE';
27ALTER TABLE "globaldomain_T_mathoid__ng_check".meta WITH read_repair = 'NONE';
28ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".data WITH read_repair = 'NONE';
29ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".meta WITH read_repair = 'NONE';
30ALTER TABLE "globaldomain_T_mathoid__ng_input".data WITH read_repair = 'NONE';
31ALTER TABLE "globaldomain_T_mathoid__ng_input".meta WITH read_repair = 'NONE';
32ALTER TABLE "globaldomain_T_mathoid__ng_mml".data WITH read_repair = 'NONE';
33ALTER TABLE "globaldomain_T_mathoid__ng_mml".meta WITH read_repair = 'NONE';
34ALTER TABLE "globaldomain_T_mathoid__ng_png".data WITH read_repair = 'NONE';
35ALTER TABLE "globaldomain_T_mathoid__ng_png".meta WITH read_repair = 'NONE';
36ALTER TABLE "globaldomain_T_mathoid__ng_svg".data WITH read_repair = 'NONE';
37ALTER TABLE "globaldomain_T_mathoid__ng_svg".meta WITH read_repair = 'NONE';
38ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'NONE';
39ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'NONE';
40ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
41ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
42ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'NONE';
43ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'NONE';
44ALTER TABLE "others_T_page__summary".data WITH read_repair = 'NONE';
45ALTER TABLE "others_T_page__summary".meta WITH read_repair = 'NONE';
46ALTER TABLE "others_T_parsoidphp".data WITH read_repair = 'NONE';
47ALTER TABLE "others_T_parsoidphp".meta WITH read_repair = 'NONE';
48ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'NONE';
49ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'NONE';
50ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'NONE';
51ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'NONE';
52ALTER TABLE pregenerated_cache.media_list WITH read_repair = 'NONE';
53ALTER TABLE pregenerated_cache.mobile_html WITH read_repair = 'NONE';
54ALTER TABLE pregenerated_cache.page_summary WITH read_repair = 'NONE';
55ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'NONE';
56ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'NONE';
57ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
58ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
59ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'NONE';
60ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'NONE';
61ALTER TABLE "wikipedia_T_page__summary".data WITH read_repair = 'NONE';
62ALTER TABLE "wikipedia_T_page__summary".meta WITH read_repair = 'NONE';
63ALTER TABLE "wikipedia_T_parsoidphp".data WITH read_repair = 'NONE';
64ALTER TABLE "wikipedia_T_parsoidphp".meta WITH read_repair = 'NONE';
65ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".data WITH read_repair = 'NONE';
66ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".meta WITH read_repair = 'NONE';
67ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".data WITH read_repair = 'NONE';
68ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".meta WITH read_repair = 'NONE';
(re)enable read repair
1ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
2ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
3ALTER TABLE "commons_T_page__summary".data WITH read_repair = 'BLOCKING';
4ALTER TABLE "commons_T_page__summary".meta WITH read_repair = 'BLOCKING';
5ALTER TABLE "commons_T_parsoidphp".data WITH read_repair = 'BLOCKING';
6ALTER TABLE "commons_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
7ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'BLOCKING';
8ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'BLOCKING';
9ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".data WITH read_repair = 'BLOCKING';
10ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".meta WITH read_repair = 'BLOCKING';
11ALTER TABLE echostore.values WITH read_repair = 'BLOCKING';
12ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'BLOCKING';
13ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'BLOCKING';
14ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
15ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
16ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'BLOCKING';
17ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'BLOCKING';
18ALTER TABLE "enwiki_T_page__summary".data WITH read_repair = 'BLOCKING';
19ALTER TABLE "enwiki_T_page__summary".meta WITH read_repair = 'BLOCKING';
20ALTER TABLE "enwiki_T_parsoidphp".data WITH read_repair = 'BLOCKING';
21ALTER TABLE "enwiki_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
22ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'BLOCKING';
23ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'BLOCKING';
24ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'BLOCKING';
25ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'BLOCKING';
26ALTER TABLE "globaldomain_T_mathoid__ng_check".data WITH read_repair = 'BLOCKING';
27ALTER TABLE "globaldomain_T_mathoid__ng_check".meta WITH read_repair = 'BLOCKING';
28ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".data WITH read_repair = 'BLOCKING';
29ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".meta WITH read_repair = 'BLOCKING';
30ALTER TABLE "globaldomain_T_mathoid__ng_input".data WITH read_repair = 'BLOCKING';
31ALTER TABLE "globaldomain_T_mathoid__ng_input".meta WITH read_repair = 'BLOCKING';
32ALTER TABLE "globaldomain_T_mathoid__ng_mml".data WITH read_repair = 'BLOCKING';
33ALTER TABLE "globaldomain_T_mathoid__ng_mml".meta WITH read_repair = 'BLOCKING';
34ALTER TABLE "globaldomain_T_mathoid__ng_png".data WITH read_repair = 'BLOCKING';
35ALTER TABLE "globaldomain_T_mathoid__ng_png".meta WITH read_repair = 'BLOCKING';
36ALTER TABLE "globaldomain_T_mathoid__ng_svg".data WITH read_repair = 'BLOCKING';
37ALTER TABLE "globaldomain_T_mathoid__ng_svg".meta WITH read_repair = 'BLOCKING';
38ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'BLOCKING';
39ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'BLOCKING';
40ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
41ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
42ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'BLOCKING';
43ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'BLOCKING';
44ALTER TABLE "others_T_page__summary".data WITH read_repair = 'BLOCKING';
45ALTER TABLE "others_T_page__summary".meta WITH read_repair = 'BLOCKING';
46ALTER TABLE "others_T_parsoidphp".data WITH read_repair = 'BLOCKING';
47ALTER TABLE "others_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
48ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'BLOCKING';
49ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'BLOCKING';
50ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'BLOCKING';
51ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'BLOCKING';
52ALTER TABLE pregenerated_cache.media_list WITH read_repair = 'BLOCKING';
53ALTER TABLE pregenerated_cache.mobile_html WITH read_repair = 'BLOCKING';
54ALTER TABLE pregenerated_cache.page_summary WITH read_repair = 'BLOCKING';
55ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'BLOCKING';
56ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'BLOCKING';
57ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
58ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
59ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'BLOCKING';
60ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'BLOCKING';
61ALTER TABLE "wikipedia_T_page__summary".data WITH read_repair = 'BLOCKING';
62ALTER TABLE "wikipedia_T_page__summary".meta WITH read_repair = 'BLOCKING';
63ALTER TABLE "wikipedia_T_parsoidphp".data WITH read_repair = 'BLOCKING';
64ALTER TABLE "wikipedia_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
65ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".data WITH read_repair = 'BLOCKING';
66ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".meta WITH read_repair = 'BLOCKING';
67ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".data WITH read_repair = 'BLOCKING';
68ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".meta WITH read_repair = 'BLOCKING';

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:07:03Z] <urandom> disabling read-repair (Cassandra) for restbase tables — T360548

Mentioned in SAL (#wikimedia-operations) [2024-03-21T17:06:01Z] <urandom> restarting decommissions (restbase1024-{b,c}) — T360548

It would seem this is still happening, even with read-repair disabled.

image.png (753×1 px, 133 KB)

It would seem this is still happening, even with read-repair disabled.

image.png (753×1 px, 133 KB)

Actually, the output of nodetool describecluster (from various nodes), leads me to thinking that schema may not be in complete agreement (read: perhaps some are still doing blocking read-repair). It's worth making sure this isn't the case before going further.

Mentioned in SAL (#wikimedia-operations) [2024-03-21T19:22:35Z] <eevans@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:11:04Z] <eevans@cumin1002> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:50:33Z] <eevans@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T22:56:49Z] <eevans@cumin1002> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T23:17:50Z] <eevans@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-22T01:31:38Z] <eevans@cumin1002> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-22T14:20:54Z] <urandom> restarting Cassandra decommission of restbase1024-{b,c} — T360548

The problem persists after a rolling restart, though at a somewhat lower rate(?)

image.png (680×1 px, 201 KB)

As the last of the decommissions are finishing (there is one in progress, and one remaining), the error rate has become very low:

image.png (865×1 px, 272 KB)

I'm not sure what to make of the results of disabling read-repair. It did not stop the errors entirely, but we can't say there is no change either. The decommissions are now complete, which makes further experimentation difficult. I think CASSANDRA-19120 is the most promising thing, so I propose that we upgrade to Cassandra 4.1.5 when it becomes available, and leave this issue open until the next decommission is needed.

Eevans changed the task status from Open to Stalled.Fri, Apr 5, 8:39 PM
Eevans raised the priority of this task from High to Needs Triage.
Eevans triaged this task as Medium priority.Fri, Apr 5, 8:45 PM