[[ https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues | An incident occurred on 2023-01-24 ]] when ([[ https://phabricator.wikimedia.org/T325132 | as part of routine maintenance ]]) one of the Cassandra hosts in eqiad (sessionstore1001) was rebooted. When sessionstore1001 went down, connections failed over to the remaining two nodes as expected. However, as soon as the rebooted node rejoined the cluster, the sessionstore service began emitting 500s and logging errors.
```lang=json
...
{"msg":"Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
{"msg":"Error reading from storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
{"msg":"Error deleting in storage (Cannot achieve consistency level EACH_QUORUM in DC eqiad","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
...
```
~~A restart of the service (may have) restored normal operation (3 minutes elapsed, so perhaps it righted itself due to other factors).~~
This happened as a result of a kind of split-brain condition. After the reboot, sessionstore1001 started and transitioned to an UP state. The other two nodes (1002 & 1003) also transitioned 1001 from DOWN to UP. However, 1001 did not recognize 1002 & 1003 as being online, it saw their state as DOWN. Meanwhile, 1001 began accepting client connections, and the driver (which also observed 1001 come online) began routing a portion of client requests to it. We replicate sessions 3 ways (per datacenter), and use QUORUM reads & writes, which is to say: we must see at least two of the three replicas to satisfy a read, and synchronously replicate at least two to successfully write. Clients that attempted to use 1001 as a coordinator node failed with an 'unavailable exception' (the source of the errors observed), because that node was unable to locate another replica in order to makesatisfy QUORUM.
{F36942653}
----
Summary of findings (thus far):
* Theis problem only manifests when a host is //rebooted//
* After a reboot, when Cassandra starts up, it sees an erroneous cluster state, where one or more of the other nodes are DOWN
* When Cassandra starts up immediately after a reboot, that node sees an erroneous cluster state, where one or more of the other nodes are DOWN
** From the perspective of all other nodes however, membership is green across the board — including the recently rebooted node
** Connectivity, in every other respect seems fine (Cassandra clients can open connections, SSH works, etc)
* The problem resolves itself after 15 minutes and some seconds (< 30); No intervention is required, and nothing would seem to extend or shorten this period
* It can be replicatedreproduced on //any// of the 6 sessionstore hosts, in either of the two datacenters (quite reliably)
* It only happens in the sessionstore cluster; It cannot be replicatedroduced on any of restbase, aqs, or the cassandra-dev clusters
** Cassandra, JVM, kernels, etc match across all clusters
----
#### See also:
- https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues (incident document)
- https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues (past incident (at least) similar (if not identical) to this one)
- https://logstash.wikimedia.org/goto/d9a0a6a1e3663b452b3119ba51049e27 (snapshot of Kask logs from the incident)
- https://user.cassandra.apache.narkive.com/SWz6Jh2j/non-zero-nodes-are-marked-as-down-after-restarting-cassandra-process (similar report from another user)
- https://issues.apache.org/jira/browse/CASSANDRA-13984 (similar report from another user)
(WARNING) **REMINDER:** UNINSTALL ADDITIONAL cassandra-dev PACKAGES (see: [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/cassandra_dev.pp#10 | operations/puppet/modules/profile/manifests/cassandra_dev.pp ]]) BEFORE CLOSING
(WARNING) **REMINDER:** ~~ROLLBACK https://gerrit.wikimedia.org/r/c/operations/puppet/+/906131 BEFORE CLOSING~~