Page MenuHomePhabricator

session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors)
Closed, ResolvedPublic

Description

An incident occurred on 2023-01-24 when (as part of routine maintenance) one of the Cassandra hosts in eqiad (sessionstore1001) was rebooted. When sessionstore1001 went down, connections failed over to the remaining two nodes as expected. However, as soon as the rebooted node rejoined the cluster, the sessionstore service began emitting 500s and logging errors.

...
{"msg":"Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
{"msg":"Error reading from storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
{"msg":"Error deleting in storage (Cannot achieve consistency level EACH_QUORUM in DC eqiad","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
...

This happened as a result of a kind of split-brain condition. After the reboot, sessionstore1001 started and transitioned to an UP state. The other two nodes (1002 & 1003) also transitioned 1001 from DOWN to UP. However, 1001 did not recognize 1002 & 1003 as being online, it saw their state as DOWN. Meanwhile, 1001 began accepting client connections, and the driver (which also observed 1001 come online) began routing a portion of client requests to it. We replicate sessions 3 ways (per datacenter), and use QUORUM reads & writes, which is to say: we must see at least two of the three replicas to satisfy a read, and synchronously replicate at least two to successfully write. Clients that attempted to use 1001 as a coordinator node failed with an 'unavailable exception' (the source of the errors observed), because that node was unable to locate another replica in order to satisfy QUORUM.

sessionstore.drawio(1).png (412×541 px, 39 KB)


Summary of findings:
  • This problem only manifests when a host is rebooted
  • When Cassandra starts up immediately after a prolonged outage, that node sees an erroneous cluster state, where one or more of the other nodes are DOWN
    • From the perspective of all other nodes however, membership is green across the board — including the recently rebooted node
    • Connectivity, in every other respect seems fine (Cassandra clients can open connections, SSH works, etc)
  • The problem resolves itself after 15 minutes and some seconds (< 30); No intervention is required, and nothing would seem to extend or shorten this period
    • Nothing done to the rebootee would seem to shorten the period; A restart of the other nodes will correct their state on the problematic one
  • It can be reproduced on any of the 6 sessionstore hosts, in either of the two datacenters (quite reliably)
  • It only happens in the sessionstore cluster; It cannot be reproduced on any of restbase, aqs, or the cassandra-dev clusters
    • Cassandra, JVM, kernels, etc match across all clusters
  • It is dependent upon the timing between a node shutdown, and startup. A simple restart (where the stop is immediately followed by a start), does not provoke, but with sufficient time between the stop and start, it does. This is why it appeared to manifest only on reboots (that was the only time that a start & stop were separated by minutes).
  • It seems to correspond to what was reported in CASSANDRA-13984. The problem lies with peer nodes (as opposed to the node that was restarted); Peers are shutting down their outbound messaging when the restarting node announces shutdown, but then (inexplicably) attempt to reconnect. If the affected node immediately starts back up this reconnect succeeds, and all is fine. However, if the node is down for any significant period, then it remains in a defunct state for (apparently) 20 minutes¹.
  • Issuing a nodetool disablegossip and nodetool drain prior to the restart seems to work around this issue (both result in an announced shutdown of Gossip, and apparently the second one succeeds)
  • Using kill -9 to simulate a ungraceful shutdown also results in correct state after startup (without the Gossip shutdown message, the racing reconnect never happens)

[¹] We observed 15 minutes, but that was 15 minutes after Cassandra startup. The additional 5 minutes is likely the period between shutdown and startup when the host is being rebooted


Next Steps
  • Follow-up with upstream
  • Remove de-initialization from c-foreach-restart
  • Add de-initialization to Cassandra systemd unit

See also:
REMINDER: UNINSTALL ADDITIONAL cassandra-dev PACKAGES (see: operations/puppet/modules/profile/manifests/cassandra_dev.pp) BEFORE CLOSING

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:36:57Z] <eevans@cumin1001> START - Cookbook sre.discovery.service-route depool sessionstore in codfw: Depooling while we attempt to reproduce errors — T327954

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:42:00Z] <eevans@cumin1001> END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in codfw: Depooling while we attempt to reproduce errors — T327954

Mentioned in SAL (#wikimedia-operations) [2023-02-15T19:48:34Z] <urandom> setting Cassandra query trace probability to 0.25 on sessionstore cluster, codfw datacenter — T327954

Mentioned in SAL (#wikimedia-operations) [2023-02-15T20:04:46Z] <urandom> setting Cassandra query trace probability to 0 (disabled) on sessionstore cluster, codfw datacenter — T327954

Mentioned in SAL (#wikimedia-operations) [2023-02-15T20:39:15Z] <urandom> rebooting sessionstore2001 w/o cookbook — T327954

Mentioned in SAL (#wikimedia-operations) [2023-02-15T21:15:29Z] <urandom> rebooting sessionstore2001 w/o cookbook — T327954

It does indeed seem to be limited to reboots of the host, and not restarts of Cassandra(!) I performed 3 separate reboots, and at least as many restarts. Restarting Cassandra from a Good™ state is successful, restarting after a reboot while in a Bad™ state does nothing to improve/restore.

When the rebootee comes back online and Cassandra is started, some subset of the other nodes have a status of DOWN. I was able to reproduce this several times on sessionstore2001, each time that subset included the other two codfw nodes (2002 & 2003), but on at least one occasion it included an eqiad node as well:

eevans@sessionstore2001:~$ c-any-nt status -r                                                                                                                                                                      
Datacenter: codfw                                                                                                                                                                                                  
=================                                                                                                                                                                                                  
Status=Up/Down                                                                                                                                                                                                     
|/ State=Normal/Leaving/Joining/Moving                                                                                                                                                                             
--  Address                         Load       Tokens       Owns (effective)  Host ID                               Rack                                                                                           
DN  sessionstore2003-a.codfw.wmnet  23.93 GiB  256          100.0%            8d062d5f-b31f-4f58-bde2-ec0e33a489fd  d                                                                                              
DN  sessionstore2002-a.codfw.wmnet  13.54 GiB  256          100.0%            31a83e9e-63a6-4f51-99fb-0040b74cb51d  c                                                                                              
UN  sessionstore2001-a.codfw.wmnet  14.74 GiB  256          100.0%            b461130b-258f-4a93-81e5-63bd31d12406  b                                                                                              
Datacenter: eqiad                                                                                                                                                                                                  
=================                                                                                                                                                                                                  
Status=Up/Down                                                                                                                                                                                                     
|/ State=Normal/Leaving/Joining/Moving                                                                                                                                                                             
--  Address                         Load       Tokens       Owns (effective)  Host ID                               Rack                                                                                           
UN  sessionstore1001-a.eqiad.wmnet  27.42 GiB  256          100.0%            59918268-6b49-45bb-a29d-9ede6385bdce  a                                                                                              
DN  sessionstore1003-a.eqiad.wmnet  30.77 GiB  256          100.0%            bcdedfef-87e6-4f54-bc33-6e149da90115  d                                                                                              
UN  sessionstore1002-a.eqiad.wmnet  30.01 GiB  256          100.0%            5af82e95-c935-4492-9d8c-04542c0c6ede  c

From the perspective of the logs, when things go well, a line like:

...
INFO  [HANDSHAKE-sessionstore1002-a.eqiad.wmnet/sessionstore1002-a] 2023-02-15 21:18:51,083 OutboundTcpConnection.java:561 - Handshaking version with sessionstore1002-a.eqiad.wmnet/10.64.32.85
...

...is almost immediately followed by one like:

...
INFO  [GossipStage:1] 2023-02-15 21:18:51,684 Gossiper.java:1161 - InetAddress /10.64.32.85 is now UP
...

However, as an example of what was logged when 2001 was the rebootee, and 2002-a & 2003-a remained in a DOWN state afterward:

eevans@sessionstore2001:~$ grep -E "(10.192.32.101|10.192.48.132)" system-a.log
...
INFO  [HANDSHAKE-sessionstore2002-a.codfw.wmnet/10.192.32.101] 2023-02-15 21:18:50,764 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2002-a.codfw.wmnet/10.192.32.101
INFO  [HANDSHAKE-sessionstore2003-a.codfw.wmnet/10.192.48.132] 2023-02-15 21:18:51,613 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2003-a.codfw.wmnet/10.192.48.132
INFO  [HANDSHAKE-sessionstore2002-a.codfw.wmnet/10.192.32.101] 2023-02-15 21:19:29,281 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2002-a.codfw.wmnet/10.192.32.101
INFO  [HANDSHAKE-sessionstore2003-a.codfw.wmnet/10.192.48.132] 2023-02-15 21:19:29,281 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2003-a.codfw.wmnet/10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:28,333 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,336 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,337 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,338 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,339 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,340 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,341 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,341 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,342 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,343 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,343 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,344 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,344 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:28,345 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,961 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,962 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,963 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,964 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,965 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,966 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,967 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,967 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,968 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,969 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,970 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,971 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,972 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:34:33,972 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP

In this example, 15 minutes (+/- 4 secs) transpired (and 14 log messages each were emitted).

The debug log though seems to provide some indication of what was happening in the interim:

DEBUG [prometheus-http-1-2] 2023-02-15 21:19:20,643 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [MessagingService-Outgoing-sessionstore2002-a.codfw.wmnet/10.192.32.101-Large] 2023-02-15 21:19:29,279 OutboundTcpConnection.java:425 - Attempting to connect to sessionstore2002-a.codfw.wmnet/10.192.32.101
DEBUG [MessagingService-Outgoing-sessionstore2003-a.codfw.wmnet/10.192.48.132-Large] 2023-02-15 21:19:29,279 OutboundTcpConnection.java:425 - Attempting to connect to sessionstore2003-a.codfw.wmnet/10.192.48.132
INFO  [HANDSHAKE-sessionstore2002-a.codfw.wmnet/10.192.32.101] 2023-02-15 21:19:29,281 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2002-a.codfw.wmnet/10.192.32.101
INFO  [HANDSHAKE-sessionstore2003-a.codfw.wmnet/10.192.48.132] 2023-02-15 21:19:29,281 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2003-a.codfw.wmnet/10.192.48.132
DEBUG [MessagingService-Outgoing-sessionstore2002-a.codfw.wmnet/10.192.32.101-Large] 2023-02-15 21:19:29,282 OutboundTcpConnection.java:533 - Done connecting to sessionstore2002-a.codfw.wmnet/10.192.32.101
DEBUG [MessagingService-Outgoing-sessionstore2003-a.codfw.wmnet/10.192.48.132-Large] 2023-02-15 21:19:29,282 OutboundTcpConnection.java:533 - Done connecting to sessionstore2003-a.codfw.wmnet/10.192.48.132
DEBUG [prometheus-http-1-3] 2023-02-15 21:19:51,159 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-4] 2023-02-15 21:20:20,634 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-5] 2023-02-15 21:20:51,100 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-1] 2023-02-15 21:21:20,544 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-2] 2023-02-15 21:21:51,185 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-3] 2023-02-15 21:22:20,550 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-4] 2023-02-15 21:22:51,055 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-5] 2023-02-15 21:23:20,634 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-1] 2023-02-15 21:23:51,107 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-2] 2023-02-15 21:24:20,541 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-3] 2023-02-15 21:24:51,080 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-4] 2023-02-15 21:25:20,554 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-5] 2023-02-15 21:25:51,081 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-1] 2023-02-15 21:26:20,568 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-2] 2023-02-15 21:26:51,061 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-3] 2023-02-15 21:27:20,545 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-4] 2023-02-15 21:27:51,084 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-5] 2023-02-15 21:28:20,599 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-1] 2023-02-15 21:28:51,058 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-2] 2023-02-15 21:29:20,563 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-3] 2023-02-15 21:29:51,071 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-4] 2023-02-15 21:30:20,547 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-5] 2023-02-15 21:30:51,066 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-1] 2023-02-15 21:31:20,573 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-2] 2023-02-15 21:31:51,070 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-3] 2023-02-15 21:32:20,544 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-4] 2023-02-15 21:32:51,104 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-5] 2023-02-15 21:33:20,564 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-1] 2023-02-15 21:33:51,067 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [prometheus-http-1-2] 2023-02-15 21:34:20,563 StorageProxy.java:2537 - Hosts not in agreement. Didn't get a response from everybody: 10.192.48.132,10.192.32.101
DEBUG [GossipStage:1] 2023-02-15 21:34:28,333 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,333 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,336 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,336 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,337 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,337 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,338 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,338 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,339 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,339 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,340 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,340 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,340 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,341 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,341 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,341 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,342 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,342 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,343 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,343 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,343 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,343 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,344 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,344 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,344 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,344 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:28,345 Gossiper.java:1160 - removing expire time for endpoint : /10.192.32.101
INFO  [GossipStage:1] 2023-02-15 21:34:28,345 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,960 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,961 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,962 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,962 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,963 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,963 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,964 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,964 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,965 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,965 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,965 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,966 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,967 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,967 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,967 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,967 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,968 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,968 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,969 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,969 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,970 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,970 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,971 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,971 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,972 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,972 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
DEBUG [GossipStage:1] 2023-02-15 21:34:33,972 Gossiper.java:1160 - removing expire time for endpoint : /10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:34:33,972 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP

More investigation is needed...

Here are logs from the previous iteration, this one too eerily close to the 15 minute mark:

INFO  [main] 2023-02-15 20:45:43,346 Server.java:159 - Starting listening for CQL clients on /10.192.16.95:9042 (encrypted)...
INFO  [main] 2023-02-15 20:45:43,395 CassandraDaemon.java:564 - Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO  [main] 2023-02-15 20:45:43,395 CassandraDaemon.java:650 - Startup complete
INFO  [Native-Transport-Requests-5] 2023-02-15 20:45:45,151 AuthCache.java:177 - (Re)initializing PermissionsCache (validity period/update interval/max entries) (600000/600000/1000)
INFO  [HANDSHAKE-/10.192.16.95] 2023-02-15 20:45:51,255 OutboundTcpConnection.java:561 - Handshaking version with /10.192.16.95
INFO  [Native-Transport-Requests-3] 2023-02-15 20:46:03,341 AuthCache.java:177 - (Re)initializing RolesCache (validity period/update interval/max entries) (2000/2000/1000)
WARN  [Native-Transport-Requests-3] 2023-02-15 20:46:03,364 FBUtilities.java:333 - Trigger directory doesn't exist, please create it and try again.
INFO  [HANDSHAKE-sessionstore2002-a.codfw.wmnet/10.192.32.101] 2023-02-15 20:46:10,377 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2002-a.codfw.wmnet/10.192.32.101
INFO  [HANDSHAKE-sessionstore2003-a.codfw.wmnet/10.192.48.132] 2023-02-15 20:46:10,377 OutboundTcpConnection.java:561 - Handshaking version with sessionstore2003-a.codfw.wmnet/10.192.48.132
INFO  [GossipStage:1] 2023-02-15 21:00:08,046 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,048 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,049 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,050 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,050 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,051 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,052 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,053 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,054 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,054 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,055 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,056 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,057 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,057 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,058 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,059 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,059 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,060 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:08,061 Gossiper.java:1161 - InetAddress /10.192.32.101 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,673 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,675 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,676 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,677 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,678 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,679 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,680 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,680 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,681 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,682 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,683 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP
INFO  [GossipStage:1] 2023-02-15 21:00:13,683 Gossiper.java:1161 - InetAddress /10.192.48.132 is now UP

And to add further intrigue, 15 minutes was the duration of last months incident.

It was suggested that perhaps the second Puppet run (the first to occur after the on-boot invocation) might mark the point where things begin working again (thanks @Volans!). Those invocations are quite close in some instances, but miss by minutes in others, and in all cases contain no changes.

[ ... ]
Feb 15 21:03:36 sessionstore2001 puppet-agent[10033]: Caching catalog for sessionstore2001.codfw.wmnet
Feb 15 21:03:36 sessionstore2001 puppet-agent[10033]: Unable to serialize catalog to json, retrying with pson
Feb 15 21:03:37 sessionstore2001 puppet-agent[10033]: Applying configuration version '(f9bcde2699) Sukhbir Singh - P:dns::auth::update: remove trailing slash for git safe.directory'
Feb 15 21:04:00 sessionstore2001 systemd[1]: Started Regular job to collect puppet agent stats.
Feb 15 21:04:00 sessionstore2001 systemd[1]: Started Export confd Prometheus metrics.
Feb 15 21:04:00 sessionstore2001 systemd[1]: confd_prometheus_metrics.service: Succeeded.
Feb 15 21:04:01 sessionstore2001 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Feb 15 21:04:01 sessionstore2001 puppet-agent[10033]: Applied catalog in 24.77 second
[ ... ]
[ ... ]
Feb 15 21:33:50 sessionstore2001 puppet-agent[6836]: Caching catalog for sessionstore2001.codfw.wmnet
Feb 15 21:33:51 sessionstore2001 puppet-agent[6836]: Unable to serialize catalog to json, retrying with pson
Feb 15 21:33:51 sessionstore2001 puppet-agent[6836]: Applying configuration version '(f9bcde2699) Sukhbir Singh - P:dns::auth::update: remove trailing slash for git safe.directory'
Feb 15 21:34:00 sessionstore2001 systemd[1]: Started Regular job to collect puppet agent stats.
Feb 15 21:34:00 sessionstore2001 systemd[1]: Started Export confd Prometheus metrics.
Feb 15 21:34:00 sessionstore2001 systemd[1]: confd_prometheus_metrics.service: Succeeded.
Feb 15 21:34:01 sessionstore2001 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Feb 15 21:34:16 sessionstore2001 puppet-agent[6836]: Applied catalog in 25.04 seconds
Feb 15 21:34:17 sessionstore2001 systemd[1]: puppet-agent-timer.service: Succeeded.
[ ... ]

image.png (332×1 px, 79 KB)

Some additional data points:

The 15 minute period seems to be absolute. Restarting Cassandra during this window —for example— does not reset the clock. It takes 15 minutes (plus some seconds; < 30s) from the time the host comes back up.

Looking at Cassandra TRACE logs, it exhibits as if the EchoMessages sent from the rebootee are not being answered during this blackout window. When the 15 minutes is up, a flood of replies from previously sent messages are logged.

Upgrading the BIOS to 2.16.1 (sessionstore100[1-3]), and the NIC firmware to 22.31.6 (sessionstore1001) does not seem to have had any affect. Interestingly though, the boot-up immediately following each BIOS or firmware upgrade works fine.

Comparing sysctl values, it seems that the restbase & aqs clusters —by virtue of using LVS— have net.ipv4.conf.all.arp_ignore=1 and net.ipv4.conf.all.arp_announce=2 (versus 0 & 0 on the sessionstore cluster). All of these clusters bind each instance of Cassandra to a secondary IPv4 address, and we've noted weirdness WRT to the source IP of outgoing packets (see: T128590: Cassandra uses default ip address for outbound packets while bootstrapping).

Eevans added a subscriber: ayounsi.

@ayounsi would you happen to have any ideas here?

Question: list of all installed packages is exactly the same (and exactly same versions too?)

@ayounsi would you happen to have any ideas here?

To figure out if it's network related you can:

  • Compare the output of arp -n before/after the 15min (assuming it's IPv4 only as the VIPs are v4)
  • Run a packet capture on the host during the 15min window, to see if it's actually trying to send packets sudo tcpdump -i eno1 -s0 -w output.pcap 'host XXX (not sure what the filters should be but ideally filter for a specific remote host or specific port)

@ayounsi would you happen to have any ideas here?

To figure out if it's network related you can:

  • Compare the output of arp -n before/after the 15min (assuming it's IPv4 only as the VIPs are v4)

This I have done; The list before/after is the same. :/

@Eevans maybe also run a loop during those 15 minutes and capture ps aux and lsof every minute and see if there's something unexpected there too.

Change 905746 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce}

https://gerrit.wikimedia.org/r/905746

Change 906131 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: make native transport (intentionally) unreachable

https://gerrit.wikimedia.org/r/906131

Change 906131 merged by Eevans:

[operations/puppet@production] sessionstore: make native transport (intentionally) unreachable

https://gerrit.wikimedia.org/r/906131

Mentioned in SAL (#wikimedia-operations) [2023-04-05T22:20:59Z] <urandom> restarting Cassandra on sessionstore1001 to apply (intentionally) unreachable native transport — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-05T22:33:30Z] <urandom> rebooting Cassandra on sessionstore1001 — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-05T23:55:50Z] <urandom> rebooting Cassandra on sessionstore1001 — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-06T00:18:59Z] <urandom> rebooting Cassandra on sessionstore1001 — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-06T00:50:07Z] <urandom> rebooting sessionstore1001 — T327954

Ok, some additional information:

@BBlack discovered this old upstream issue that bears some resemblance to what we are seeing. Based on information reported in that issue, I performed 4 new reboot tests:

1drain && disablegossip && sleep 10 && rebootsuccess
2rebootfail
3disablegossip && sleep 10 && rebootfail
4drain && sleep 10 && rebootfail

Some notes on this: A drain causes the Gossiper to announce a shutdown to the other nodes (i.e. a disablegossip is implicit), as a result the disablegossip command I issued in test no. 1 failed (errored, and exited with non-zero status). Test no. 3 was an attempt to establish if the success seen in 1 was actually the result of disabling gossip and announcing shutdown, and not the drain (which does not seem to be the case). Test no. 4 was an attempt to reproduce the results of test 1, but without the failed disablegossip invocation. This lead me to believe that the success seen in 1 might have been random chance (not reproducible), but on a re-read of CASSANDRA-13984, the following caught my eye:

Based on this, it would appear that nodetool drain is correctly gossiping to its peers that it is shutting down, and the peers teardown their own gossip and small connections. The bug appears to be the fact that peers nearly instantly reconnect the small connection which then lingers around in a dead state until a 20 minute timeout kills it. The OutboundTcpConnection class does not seem to detect this condition, and OutboundTcpConnectionPool reuses this dead connection for hint playback which is causing the partially log lines and the JMX timeout metrics for the hints service.

We have worked around this issue by issuing a nodetool disablegossip after nodetool drain to further kill any of these connections and have not seen any issues on the 8 node restarts we've done since that process change. This has then caused the small connection to get killed correctly, which allows it to reconnect correctly once the node restarts and have hints playback correctly.

So it may in fact be worth re-running test no. 1 just as I originally did, failed invocation of disablegossip and all, just to be sure that it doesn't ... I don't know, extra-special double-disable gossip? The hour is late here though, so I will pick this up tomorrow.

Mentioned in SAL (#wikimedia-operations) [2023-04-06T13:56:48Z] <urandom> rebooting sessionstore1001 — T327954

[ ... ]
So it may in fact be worth re-running test no. 1 just as I originally did, failed invocation of disablegossip and all, just to be sure that it doesn't ... I don't know, extra-special double-disable gossip? The hour is late here though, so I will pick this up tomorrow.

5drain && disablegossip; sleep 10 && rebootsuccess

🤯

Mentioned in SAL (#wikimedia-operations) [2023-04-07T00:39:26Z] <urandom> rebooting sessionstore1001 — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-07T01:02:39Z] <urandom> rebooting sessionstore1001 — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-07T01:10:44Z] <urandom> rebooting sessionstore1001 — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-07T01:17:35Z] <urandom> rebooting sessionstore1001 — T327954

6drain && disablegossip && rebootfail
7drain && disablegossip; sleep 10 && rebootsuccess
8drain && disablegossip; sleep 10 && rebootsuccess
9drain && disablegossip; sleep 10 && rebootsuccess

Mentioned in SAL (#wikimedia-operations) [2023-04-07T17:02:06Z] <urandom> restart Cassandra, sessionstore1001-a (re-enabling CQL) — T327954

Eevans renamed this task from session storage: 'cannot achieve consistency level' errors to session storage: dissonant cluster status after reboot (was: 'cannot achieve consistency level' errors).Apr 7 2023, 6:32 PM
Eevans updated the task description. (Show Details)

Change 905746 merged by Eevans:

[operations/puppet@production] sessionstore: assign values to net.ipv4.conf.all.arp_{ignore,announce}

https://gerrit.wikimedia.org/r/905746

Change 907527 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: (intentionally) make native transport unreachable

https://gerrit.wikimedia.org/r/907527

Change 907527 merged by Eevans:

[operations/puppet@production] sessionstore: (intentionally) make native transport unreachable

https://gerrit.wikimedia.org/r/907527

Mentioned in SAL (#wikimedia-operations) [2023-04-10T21:04:55Z] <urandom> restarting Cassandra, sessionstore1002-a — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-10T21:06:26Z] <urandom> restarting Cassandra, sessionstore1003-a — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-10T21:31:08Z] <urandom> restarting Cassandra, sessionstore1002-a — T327954

As an experiment: net.ipv4.conf.all.arp_ignore net.ipv4.conf.all.arp_announce were set to 1 & 2 respectively. This matches what is found on our other clusters, where it is configured as part of the wikimedia-lvs-realserver package (sessionstore is the only cluster where it is not installed). As in earlier tests, sessionstore1001 was chosen as the guinea pig (after fencing off client connections to prevent session loss), and rebooted 3 times.

#outcome
11002-a marked DOWN, 1003-a UP
2success (all)
3fail (all)

Mentioned in SAL (#wikimedia-operations) [2023-04-10T21:46:04Z] <urandom> restarting Cassandra, sessionstore1001-a, to restore native transport settings — T327954

Eevans updated the task description. (Show Details)

Icinga downtime and Alertmanager silence (ID=03c64891-2856-4276-b526-34b6b048e307) set by eevans@cumin1001 for 6:00:00 on 1 host(s) and their services with reason: Reproducing dissonant cluster state

sessionstore1001.eqiad.wmnet

Change 908330 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] sessionstore: disable sessionstore1001 native transport

https://gerrit.wikimedia.org/r/908330

Change 908330 merged by Eevans:

[operations/puppet@production] sessionstore: disable sessionstore1001 native transport

https://gerrit.wikimedia.org/r/908330

Mentioned in SAL (#wikimedia-operations) [2023-04-12T19:28:24Z] <urandom> restart Cassandra —sessionstore1001— to disable native transport for testing — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-12T19:37:02Z] <urandom> sessionstore1001: systemctl stop cassandra-a.service && systemctl start cassandra-a.service — T327954

In service of the question: Is it in fact a reboot, or does it amount to a restart occurring under different conditions?

A reboot effectively is a systemctl stop cassandra & systemctl start cassandra, with a host shutdown/startup happening in between. When we restart Cassandra, we do so using c-foreach-restart which does a stop & start using systemctl, but is preceded by a disablenative, disablethrift, and an drain. So is the dissonant state caused by something related to a reboot, or is that in fact a red herring?

Let's see...

#restartcluster state
1sudo systemctl stop cassandra-a.service && sudo systemctl start cassandra-a.servicecorrect
2sudo systemctl stop cassandra-a.service && sudo systemctl start cassandra-a.servicecorrect
3sudo systemctl stop cassandra-a.service && sudo systemctl start cassandra-a.servicecorrect
4sudo systemctl stop cassandra-a.service && sleep 180 && sudo systemctl start cassandra-a.serviceincorrect
5sudo systemctl stop cassandra-a.service && sleep 180 && sudo systemctl start cassandra-a.serviceincorrect
6sudo systemctl stop cassandra-a.service && sleep 180 && sudo systemctl start cassandra-a.serviceincorrect

So it would seem that use of c-foreach-restart (or the equivalent of systemctl restart ...) succeeds because the period between the start & stop is brief (when compared to a reboot)?

Mentioned in SAL (#wikimedia-operations) [2023-04-12T21:35:50Z] <urandom> restarting Cassandra —sessionstore1001— to reenable native transport — T327954

I am convinced now that —distracting bits about hint delivery notwithstanding— we are essentially seeing the same thing described in CASSANDRA-13984. Peers are shutting down their TCP connections when a node announces shutdown (as expected), and then immediately initiating a new connection. This new, defunct connection, hangs around well after the restarting node has come back up, preventing outbound messages from being delivered until it finally times out (the 15 minutes we've observed is likely the same 20 minutes reported in CASSANDRA-13984, plus 5 minutes for the host to fully reboot).

CASSANDRA-13984 reports successful restarts using a combination of disablegossip and drain. Both commands gossip shutdown, presumably the second succeeds where the first did not. Repeating the previous tests but with a preceding disablegossip and drain confirms that this works for us as well.

#restart sequencecluster state
1c-foreach-nt disablegossip; c-foreach-nt drain; sudo systemctl stop cassandra-a.service && sleep 180 && sudo systemctl start cassandra-a.servicecorrect
2c-foreach-nt disablegossip; c-foreach-nt drain; sudo systemctl stop cassandra-a.service && sleep 180 && sudo systemctl start cassandra-a.servicecorrect
3c-foreach-nt disablegossip; c-foreach-nt drain; sudo systemctl stop cassandra-a.service && sleep 180 && sudo systemctl start cassandra-a.servicecorrect

Since (ironically) it would seem that a graceful shutdown is what triggers this issue, I had a hunch that a less than graceful termination might result in a clean startup:

#restart sequencecluster state
1sudo kill -9 $(cat /var/run/cassandra/cassandra-a.pid) && sleep 180 && sudo systemctl start cassandra-a.servicecorrect
2sudo kill -9 $(cat /var/run/cassandra/cassandra-a.pid) && sleep 180 && sudo systemctl start cassandra-a.servicecorrect
3sudo kill -9 $(cat /var/run/cassandra/cassandra-a.pid) && sleep 180 && sudo systemctl start cassandra-a.servicecorrect

This is good news; This means we can work around this issue by including a disablegossip and drain in normal shutdowns (ala the systemd unit), and shouldn't need to worry about an ungraceful shutdown (aka a crash) creating an outage when the node is brought back up.

Mentioned in SAL (#wikimedia-operations) [2023-04-18T00:28:06Z] <eevans@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on cassandra-dev2001.codfw.wmnet with reason: testing systemd unit changes — T327954

Mentioned in SAL (#wikimedia-operations) [2023-04-18T00:28:21Z] <eevans@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cassandra-dev2001.codfw.wmnet with reason: testing systemd unit changes — T327954

Change 934399 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: remove (temp) packages used in debugging

https://gerrit.wikimedia.org/r/934399

Change 934399 merged by Eevans:

[operations/puppet@production] cassandra_dev: remove (temp) packages used in debugging

https://gerrit.wikimedia.org/r/934399

While the race condition that resulted in these incidents was never directly identified and fixed (in 3.11.x), an upgrade to Cassandra 4.1.1 —where messaging was all but rewritten— solved the issue for us. Since we're no longer running 3.11.x anywhere in production, we can close this issue.