[[ https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues | An incident occurred on 2023-01-24 ]] when ([[ https://phabricator.wikimedia.org/T325132 | as part of routine maintenance ]]) one of the Cassandra hosts in eqiad (sessionstore1001) was rebooted. When sessionstore1001 went down, connections failed over to the remaining two nodes as expected. However, as soon as the rebooted node rejoined the cluster, the sessionstore service began emitting 500s and logging errors.
```lang=json
...
{"msg":"Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
{"msg":"Error reading from storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
{"msg":"Error deleting in storage (Cannot achieve consistency level EACH_QUORUM in DC eqiad","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."}
...
```
This happened as a result of a kind of split-brain condition. After the reboot, sessionstore1001 started and transitioned to an UP state. The other two nodes (1002 & 1003) also transitioned 1001 from DOWN to UP. However, 1001 did not recognize 1002 & 1003 as being online, it saw their state as DOWN. Meanwhile, 1001 began accepting client connections, and the driver (which also observed 1001 come online) began routing a portion of client requests to it. We replicate sessions 3 ways (per datacenter), and use QUORUM reads & writes, which is to say: we must see at least two of the three replicas to satisfy a read, and synchronously replicate at least two to successfully write. Clients that attempted to use 1001 as a coordinator node failed with an 'unavailable exception' (the source of the errors observed), because that node was unable to locate another replica in order to satisfy QUORUM.
{F36942653}
----
#### Summary of findings (thus far):
* This problem only manifests when a host is //rebooted//
* When Cassandra starts up immediately after a reboot, that node sees an erroneous cluster state, where one or more of the other nodes are DOWN
** From the perspective of all other nodes however, membership is green across the board — including the recently rebooted node
** Connectivity, in every other respect seems fine (Cassandra clients can open connections, SSH works, etc)
* The problem resolves itself after 15 minutes and some seconds (< 30); No intervention is required, and nothing would seem to extend or shorten this period
* It can be reproduced on //any// of the 6 sessionstore hosts, in either of the two datacenters (quite reliably)
* It only happens in the sessionstore cluster; It cannot be reproduced on any of restbase, aqs, or the cassandra-dev clusters
** Cassandra, JVM, kernels, etc match across all clusters
* [[ https://issues.apache.org/jira/browse/CASSANDRA-13984 | CASSANDRA-13984 ]], and [[ https://user.cassandra.apache.narkive.com/SWz6Jh2j/non-zero-nodes-are-marked-as-down-after-restarting-cassandra-process | this user@cassandra mailing list post ]], both describe similar scenarios. Though neither surfaces an underlying cause, they both detail some combination of `nodetool drain` & `nodetool disablegossip` that results in a reboot where the node immediately returns to a stable state. This would seem to be case for us as well; Implementing a shutdown sequence that includes a `drain`, `disablegossip`, and `sleep 10` preceding the reboot seems to avoid the 15 minutes of dissonant cluster state
----
#### Additional explorations:
Our multi-instance configuration uses secondary IP interfaces (one per instance), bound to the same device as the hosts main IPv4 address. This creates some odd —asymmetric behavior— where responses from the secondary IPs are sent //from// the hosts main IP. This has always been the case (see for example T128590); This is true for the other clusters as well, and they do not exhibit this problem. However, there may be something unique to this cluster that provokes the issue where the others do not (the low node count, or the fact that these hosts do not also run LVS). So:
# [ ] Match `net.ipv4.conf.all.arp_{ignore,announce}` settings to those of the other clusters (where set by LVS) ([[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/905746 | r905746 ]])
# [ ] Configure routing tables to send responses via the receiving IP interface
----
#### See also:
- https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues (incident document)
- https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues (past incident (at least) similar (if not identical) to this one)
- https://logstash.wikimedia.org/goto/d9a0a6a1e3663b452b3119ba51049e27 (snapshot of Kask logs from the incident)
- https://user.cassandra.apache.narkive.com/SWz6Jh2j/non-zero-nodes-are-marked-as-down-after-restarting-cassandra-process (similar report from another user)
- https://issues.apache.org/jira/browse/CASSANDRA-13984 (similar report from another user)
- https://unix.stackexchange.com/questions/4420/reply-on-same-interface-as-incoming
(WARNING) **REMINDER:** UNINSTALL ADDITIONAL cassandra-dev PACKAGES (see: [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/cassandra_dev.pp#10 | operations/puppet/modules/profile/manifests/cassandra_dev.pp ]]) BEFORE CLOSING
(WARNING) **REMINDER:** ~~ROLLBACK https://gerrit.wikimedia.org/r/c/operations/puppet/+/906131 BEFORE CLOSING~~