Cassandra quorum read timeouts during node decommissions
Open, Stalled, MediumPublic
Actions

Assigned To

None

Authored By

	Eevans
	Mar 20 2024, 5:03 PM

Description

After today's data-center switchover (codfw to eqiad) elevated errors were observed for changeprop. These errors were bubbling up from RESTBase, and ultimately Cassandra:

changeprop

{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/update_error","title":"Internal error in Cassandra table storage backend","method":"get","uri":"/de.wikipedia.org/v1/page/html/Benutzer%3AHerzi_Pinki%2FsyncTest/243290995","internalURI":"https://restbase-async.discovery.wmnet:7443/de.wikipedia.org/v1/page/html/Benutzer%3AHerzi_Pinki%2FsyncTest/243290995","internalMethod":"get"}

restbase

ResponseError: Server timeout during read query at consistency LOCAL_QUORUM (2 replica(s) responded over 3 required)
    at FrameReader.readError (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/readers.js:326:15)
    at Parser.parseBody (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:194:66)
    at Parser._transform (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:137:10)
    at Parser.Transform._read (_stream_transform.js:191:10)
    at Parser.Transform._write (_stream_transform.js:179:12)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Parser.Writable.write (_stream_writable.js:318:11)
    at Protocol.ondata (_stream_readable.js:718:22)
    at Protocol.emit (events.js:314:20)
    at addChunk (_stream_readable.js:297:12)
    at readableAddChunk (_stream_readable.js:272:9)
    at Protocol.Readable.push (_stream_readable.js:213:10)
    at Protocol.Transform.push (_stream_transform.js:152:32)
    at Protocol.readItems (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:109:10)
    at Protocol._transform (/srv/deployment/restbase/deploy-cache/revs/7e5e72087d8331131669babfb8f40b269c024cd7/node_modules/cassandra-driver/lib/streams.js:32:10)

This error indicates a read timeout while attempting a LOCAL_QUORUM, specifically that only two copies responded during the timeout period, when at least 3 were required. What is puzzling here, is that each data-center only holds 3 replicas, so a LOCAL_QUORUM is in fact 2 copies!

Working hypothesis

There was a decommission running in eqiad at the time of the switchover (T354561). Decommissions (and bootstraps) temporarily increase the replication by one (to include the node joining/leaving), and enforce consistency level requirements for writes. This is necessary to maintain consistency guarantees and bootstrap/decommission atomicity. The failed operations were reads, but these tables have blocking read repair enabled which means that (blocking) writes can be triggered (probabilisticly?) by a read. This would help explain the biggest mystery, namely: Why a read operation would fail with such an unexpected quorum requirement.

This still does not explain why this larger than expected quorum would fail; Even a quorum of 3 should succeed in an otherwise healthy cluster (the odd transient/spurious failure notwithstanding). CASSANDRA-19129 might offer the final piece to this puzzle.

Next steps

~~Terminate the decommission to see if the errors cease. If they do, disable blocking read-repair and attempt another decommission.~~ If the errors do not return, we can assume CASSANDRA-19129 is the culprit, and either backport the patch, or wait for a 4.1.5 release.

Update: Terminating the decommission stopped the read timeouts.

Update: Disabling read-repair did not make the errors disappear entirely, but the rate of errors dropped quite low.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Stalled		None	T360548 Cassandra quorum read timeouts during node decommissions
		Open		None	T354970 Upgrade Cassandra to 4.1.5

Event Timeline

Eevans created this task.Mar 20 2024, 5:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2024, 5:03 PM

Eevans triaged this task as High priority.Mar 20 2024, 5:05 PM

Eevans updated the task description. (Show Details)

Eevans mentioned this in T354561: Hardware refresh: Decommission restbase10[19-27].Mar 20 2024, 5:08 PM

disable read repair

P58876 Command-Line Input

1	ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
2	ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
3	ALTER TABLE "commons_T_page__summary".data WITH read_repair = 'NONE';
4	ALTER TABLE "commons_T_page__summary".meta WITH read_repair = 'NONE';
5	ALTER TABLE "commons_T_parsoidphp".data WITH read_repair = 'NONE';
6	ALTER TABLE "commons_T_parsoidphp".meta WITH read_repair = 'NONE';
7	ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'NONE';
8	ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'NONE';
9	ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".data WITH read_repair = 'NONE';
10	ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".meta WITH read_repair = 'NONE';
11	ALTER TABLE echostore.values WITH read_repair = 'NONE';
12	ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'NONE';
13	ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'NONE';
14	ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
15	ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
16	ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'NONE';
17	ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'NONE';
18	ALTER TABLE "enwiki_T_page__summary".data WITH read_repair = 'NONE';
19	ALTER TABLE "enwiki_T_page__summary".meta WITH read_repair = 'NONE';
20	ALTER TABLE "enwiki_T_parsoidphp".data WITH read_repair = 'NONE';
21	ALTER TABLE "enwiki_T_parsoidphp".meta WITH read_repair = 'NONE';
22	ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'NONE';
23	ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'NONE';
24	ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'NONE';
25	ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'NONE';
26	ALTER TABLE "globaldomain_T_mathoid__ng_check".data WITH read_repair = 'NONE';
27	ALTER TABLE "globaldomain_T_mathoid__ng_check".meta WITH read_repair = 'NONE';
28	ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".data WITH read_repair = 'NONE';
29	ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".meta WITH read_repair = 'NONE';
30	ALTER TABLE "globaldomain_T_mathoid__ng_input".data WITH read_repair = 'NONE';
31	ALTER TABLE "globaldomain_T_mathoid__ng_input".meta WITH read_repair = 'NONE';
32	ALTER TABLE "globaldomain_T_mathoid__ng_mml".data WITH read_repair = 'NONE';
33	ALTER TABLE "globaldomain_T_mathoid__ng_mml".meta WITH read_repair = 'NONE';
34	ALTER TABLE "globaldomain_T_mathoid__ng_png".data WITH read_repair = 'NONE';
35	ALTER TABLE "globaldomain_T_mathoid__ng_png".meta WITH read_repair = 'NONE';
36	ALTER TABLE "globaldomain_T_mathoid__ng_svg".data WITH read_repair = 'NONE';
37	ALTER TABLE "globaldomain_T_mathoid__ng_svg".meta WITH read_repair = 'NONE';
38	ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'NONE';
39	ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'NONE';
40	ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
41	ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
42	ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'NONE';
43	ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'NONE';
44	ALTER TABLE "others_T_page__summary".data WITH read_repair = 'NONE';
45	ALTER TABLE "others_T_page__summary".meta WITH read_repair = 'NONE';
46	ALTER TABLE "others_T_parsoidphp".data WITH read_repair = 'NONE';
47	ALTER TABLE "others_T_parsoidphp".meta WITH read_repair = 'NONE';
48	ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'NONE';
49	ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'NONE';
50	ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'NONE';
51	ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'NONE';
52	ALTER TABLE pregenerated_cache.media_list WITH read_repair = 'NONE';
53	ALTER TABLE pregenerated_cache.mobile_html WITH read_repair = 'NONE';
54	ALTER TABLE pregenerated_cache.page_summary WITH read_repair = 'NONE';
55	ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'NONE';
56	ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'NONE';
57	ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'NONE';
58	ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'NONE';
59	ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'NONE';
60	ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'NONE';
61	ALTER TABLE "wikipedia_T_page__summary".data WITH read_repair = 'NONE';
62	ALTER TABLE "wikipedia_T_page__summary".meta WITH read_repair = 'NONE';
63	ALTER TABLE "wikipedia_T_parsoidphp".data WITH read_repair = 'NONE';
64	ALTER TABLE "wikipedia_T_parsoidphp".meta WITH read_repair = 'NONE';
65	ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".data WITH read_repair = 'NONE';
66	ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".meta WITH read_repair = 'NONE';
67	ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".data WITH read_repair = 'NONE';
68	ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".meta WITH read_repair = 'NONE';

(re)enable read repair

P58877 Command-Line Input

1	ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
2	ALTER TABLE "commons_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
3	ALTER TABLE "commons_T_page__summary".data WITH read_repair = 'BLOCKING';
4	ALTER TABLE "commons_T_page__summary".meta WITH read_repair = 'BLOCKING';
5	ALTER TABLE "commons_T_parsoidphp".data WITH read_repair = 'BLOCKING';
6	ALTER TABLE "commons_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
7	ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'BLOCKING';
8	ALTER TABLE "commons_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'BLOCKING';
9	ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".data WITH read_repair = 'BLOCKING';
10	ALTER TABLE "commons_T_title__revisions3WsaB42Wia1E_eq_KmoYTH".meta WITH read_repair = 'BLOCKING';
11	ALTER TABLE echostore.values WITH read_repair = 'BLOCKING';
12	ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'BLOCKING';
13	ALTER TABLE "enwiki_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'BLOCKING';
14	ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
15	ALTER TABLE "enwiki_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
16	ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'BLOCKING';
17	ALTER TABLE "enwiki_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'BLOCKING';
18	ALTER TABLE "enwiki_T_page__summary".data WITH read_repair = 'BLOCKING';
19	ALTER TABLE "enwiki_T_page__summary".meta WITH read_repair = 'BLOCKING';
20	ALTER TABLE "enwiki_T_parsoidphp".data WITH read_repair = 'BLOCKING';
21	ALTER TABLE "enwiki_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
22	ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'BLOCKING';
23	ALTER TABLE "enwiki_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'BLOCKING';
24	ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'BLOCKING';
25	ALTER TABLE "enwiki_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'BLOCKING';
26	ALTER TABLE "globaldomain_T_mathoid__ng_check".data WITH read_repair = 'BLOCKING';
27	ALTER TABLE "globaldomain_T_mathoid__ng_check".meta WITH read_repair = 'BLOCKING';
28	ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".data WITH read_repair = 'BLOCKING';
29	ALTER TABLE "globaldomain_T_mathoid__ng_hash__table".meta WITH read_repair = 'BLOCKING';
30	ALTER TABLE "globaldomain_T_mathoid__ng_input".data WITH read_repair = 'BLOCKING';
31	ALTER TABLE "globaldomain_T_mathoid__ng_input".meta WITH read_repair = 'BLOCKING';
32	ALTER TABLE "globaldomain_T_mathoid__ng_mml".data WITH read_repair = 'BLOCKING';
33	ALTER TABLE "globaldomain_T_mathoid__ng_mml".meta WITH read_repair = 'BLOCKING';
34	ALTER TABLE "globaldomain_T_mathoid__ng_png".data WITH read_repair = 'BLOCKING';
35	ALTER TABLE "globaldomain_T_mathoid__ng_png".meta WITH read_repair = 'BLOCKING';
36	ALTER TABLE "globaldomain_T_mathoid__ng_svg".data WITH read_repair = 'BLOCKING';
37	ALTER TABLE "globaldomain_T_mathoid__ng_svg".meta WITH read_repair = 'BLOCKING';
38	ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'BLOCKING';
39	ALTER TABLE "others_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'BLOCKING';
40	ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
41	ALTER TABLE "others_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
42	ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'BLOCKING';
43	ALTER TABLE "others_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'BLOCKING';
44	ALTER TABLE "others_T_page__summary".data WITH read_repair = 'BLOCKING';
45	ALTER TABLE "others_T_page__summary".meta WITH read_repair = 'BLOCKING';
46	ALTER TABLE "others_T_parsoidphp".data WITH read_repair = 'BLOCKING';
47	ALTER TABLE "others_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
48	ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".data WITH read_repair = 'BLOCKING';
49	ALTER TABLE "others_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnFw".meta WITH read_repair = 'BLOCKING';
50	ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".data WITH read_repair = 'BLOCKING';
51	ALTER TABLE "others_T_title__revisions3WsaB42Wia1E_eq_KmoYTHe".meta WITH read_repair = 'BLOCKING';
52	ALTER TABLE pregenerated_cache.media_list WITH read_repair = 'BLOCKING';
53	ALTER TABLE pregenerated_cache.mobile_html WITH read_repair = 'BLOCKING';
54	ALTER TABLE pregenerated_cache.page_summary WITH read_repair = 'BLOCKING';
55	ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".data WITH read_repair = 'BLOCKING';
56	ALTER TABLE "wikipedia_T_mediaEKhjXRLvi0_47tdHyL7nahpjtY4".meta WITH read_repair = 'BLOCKING';
57	ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".data WITH read_repair = 'BLOCKING';
58	ALTER TABLE "wikipedia_T_mobileoZCBVtILw5eSrwi0VIGaFVSr2jY".meta WITH read_repair = 'BLOCKING';
59	ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".data WITH read_repair = 'BLOCKING';
60	ALTER TABLE "wikipedia_T_mobileVuV78NVnWlTWLhDwwr5v2FMH2wA".meta WITH read_repair = 'BLOCKING';
61	ALTER TABLE "wikipedia_T_page__summary".data WITH read_repair = 'BLOCKING';
62	ALTER TABLE "wikipedia_T_page__summary".meta WITH read_repair = 'BLOCKING';
63	ALTER TABLE "wikipedia_T_parsoidphp".data WITH read_repair = 'BLOCKING';
64	ALTER TABLE "wikipedia_T_parsoidphp".meta WITH read_repair = 'BLOCKING';
65	ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".data WITH read_repair = 'BLOCKING';
66	ALTER TABLE "wikipedia_T_parsoidphpAetaDBIXR_iWCeS5MknjkVQgnF".meta WITH read_repair = 'BLOCKING';
67	ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".data WITH read_repair = 'BLOCKING';
68	ALTER TABLE "wikipedia_T_title__revisions3WsaB42Wia1E_eq_KmoY".meta WITH read_repair = 'BLOCKING';

Mentioned in SAL (#wikimedia-operations) [2024-03-21T16:07:03Z] <urandom> disabling read-repair (Cassandra) for restbase tables — T360548

Mentioned in SAL (#wikimedia-operations) [2024-03-21T17:06:01Z] <urandom> restarting decommissions (restbase1024-{b,c}) — T360548

It would seem this is still happening, even with read-repair disabled.

In T360548#9650738, @Eevans wrote:

It would seem this is still happening, even with read-repair disabled.

Actually, the output of nodetool describecluster (from various nodes), leads me to thinking that schema may not be in complete agreement (read: perhaps some are still doing blocking read-repair). It's worth making sure this isn't the case before going further.

Mentioned in SAL (#wikimedia-operations) [2024-03-21T19:22:35Z] <eevans@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:11:04Z] <eevans@cumin1002> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[28,32,34-36].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:50:33Z] <eevans@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T22:56:49Z] <eevans@cumin1002> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase10[29,32,37-39,25-27,30,33,40-42].eqiad.wmnet: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-21T23:17:50Z] <eevans@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-22T01:31:38Z] <eevans@cumin1002> END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Turn it off, and then back on again (schema agreement/reachability)? — T360548 - eevans@cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-03-22T14:20:54Z] <urandom> restarting Cassandra decommission of restbase1024-{b,c} — T360548

The problem persists after a rolling restart, though at a somewhat lower rate(?)

hnowlan mentioned this in T360597: Increased latency, timeouts from wikifeeds since march 10th.Mar 25 2024, 11:36 AM

As the last of the decommissions are finishing (there is one in progress, and one remaining), the error rate has become very low:

Eevans updated the task description. (Show Details)Wed, Apr 3, 6:53 PM

Eevans updated the task description. (Show Details)Wed, Apr 3, 6:55 PM

I'm not sure what to make of the results of disabling read-repair. It did not stop the errors entirely, but we can't say there is no change either. The decommissions are now complete, which makes further experimentation difficult. I think CASSANDRA-19120 is the most promising thing, so I propose that we upgrade to Cassandra 4.1.5 when it becomes available, and leave this issue open until the next decommission is needed.

Eevans moved this task from Backlog to Blocked on the Cassandra board.Wed, Apr 3, 7:15 PM

Eevans mentioned this in T354970: Upgrade Cassandra to 4.1.5.Wed, Apr 3, 7:20 PM

Eevans added a subtask: T354970: Upgrade Cassandra to 4.1.5.

Eevans changed the task status from Open to Stalled.Fri, Apr 5, 8:39 PM

Eevans raised the priority of this task from High to Needs Triage.

Eevans triaged this task as Medium priority.Fri, Apr 5, 8:45 PM

	F43777569: image.png
	Fri, Mar 29, 4:02 PM

	F43071959: image.png
	Mar 22 2024, 7:17 PM

	F42959085: image.png
	Mar 21 2024, 5:10 PM

Cassandra quorum read timeouts during node decommissionsOpen, Stalled, MediumPublicActions