Change Details

[[ https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues | An incident occurred on 2023-01-24 ]] when ([[ https://phabricator.wikimedia.org/T325132 | as part of routine maintenance ]]) one of the Cassandra hosts in eqiad (sessionstore1001) was rebooted. When sessionstore1001 went down, connections failed over to the remaining two nodes as expected. However, as soon as the rebooted node rejoined the cluster, the sessionstore service began emitting 500s and logging errors. ```lang=json ... {"msg":"Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."} {"msg":"Error reading from storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."} {"msg":"Error deleting in storage (Cannot achieve consistency level EACH_QUORUM in DC eqiad","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."} ... ``` ~~A restart of the service (may have) restored normal operation (3 minutes elapsed, so perhaps it righted itself due to other factors).~~ ---- Summary of findings (thus far): * The problem only manifests when a host is //rebooted// * After a reboot, when Cassandra starts up, it sees an erroneous cluster state, where one or more of the other nodes are DOWN ** From the perspective of all other nodes however, membership is green across the board — including the recently rebooted node ** Connectivity, in every other respect seems fine (Cassandra clients can open connections, SSH works, etc) * The problem resolves itself after 15 minutes and some seconds (< 30); No intervention is required, and nothing would seem to extend or shorten this period * It can be replicated on //any// of the 6 sessionstore hosts, in either of the two datacenters (quite reliably) * It only happens in the sessionstore cluster; It cannot be replicated on any of restbase, aqs, or the cassandra-dev clusters ** Cassandra, JVM, kernels, etc match across all clusters ---- #### See also: - https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues (incident document) - https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues (past incident (at least) similar (if not identical) to this one) - https://logstash.wikimedia.org/goto/d9a0a6a1e3663b452b3119ba51049e27 (snapshot of Kask logs from the incident) - https://user.cassandra.apache.narkive.com/SWz6Jh2j/non-zero-nodes-are-marked-as-down-after-restarting-cassandra-process (similar report from another user) - https://issues.apache.org/jira/browse/CASSANDRA-13984 (similar report from another user) (WARNING) **REMINDER:** UNINSTALL ADDITIONAL cassandra-dev PACKAGES (see: [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/cassandra_dev.pp#10 | operations/puppet/modules/profile/manifests/cassandra_dev.pp ]]) BEFORE CLOSING (WARNING) **REMINDER:** ~~ROLLBACK https://gerrit.wikimedia.org/r/c/operations/puppet/+/906131 BEFORE CLOSING~~

[[ https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues | An incident occurred on 2023-01-24 ]] when ([[ https://phabricator.wikimedia.org/T325132 | as part of routine maintenance ]]) one of the Cassandra hosts in eqiad (sessionstore1001) was rebooted. When sessionstore1001 went down, connections failed over to the remaining two nodes as expected. However, as soon as the rebooted node rejoined the cluster, the sessionstore service began emitting 500s and logging errors. ```lang=json ... {"msg":"Error writing to storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."} {"msg":"Error reading from storage (Cannot achieve consistency level LOCAL_QUORUM)","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."} {"msg":"Error deleting in storage (Cannot achieve consistency level EACH_QUORUM in DC eqiad","appname":"sessionstore","time":"2023-01-24T21:10:27Z","level":"ERROR","request_id":"..."} ... ``` ~~A restart of the service (may have) restored normal operation (3 minutes elapsed, so perhaps it righted itself due to other factors).~~ This happened as a result of a kind of split-brain condition. After the reboot, sessionstore1001 started and transitioned to an UP state. The other two nodes (1002 & 1003) also transitioned 1001 from DOWN to UP. However, 1001 did not recognize 1002 & 1003 as being online, and listed their state as DOWN. Meanwhile, 1001 began accepting client connections, and the driver began routing a portion of client requests to it. {F36942653} ---- Summary of findings (thus far): * The problem only manifests when a host is //rebooted// * After a reboot, when Cassandra starts up, it sees an erroneous cluster state, where one or more of the other nodes are DOWN ** From the perspective of all other nodes however, membership is green across the board — including the recently rebooted node ** Connectivity, in every other respect seems fine (Cassandra clients can open connections, SSH works, etc) * The problem resolves itself after 15 minutes and some seconds (< 30); No intervention is required, and nothing would seem to extend or shorten this period * It can be replicated on //any// of the 6 sessionstore hosts, in either of the two datacenters (quite reliably) * It only happens in the sessionstore cluster; It cannot be replicated on any of restbase, aqs, or the cassandra-dev clusters ** Cassandra, JVM, kernels, etc match across all clusters ---- #### See also: - https://wikitech.wikimedia.org/wiki/Incidents/2023-01-24_sessionstore_quorum_issues (incident document) - https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues (past incident (at least) similar (if not identical) to this one) - https://logstash.wikimedia.org/goto/d9a0a6a1e3663b452b3119ba51049e27 (snapshot of Kask logs from the incident) - https://user.cassandra.apache.narkive.com/SWz6Jh2j/non-zero-nodes-are-marked-as-down-after-restarting-cassandra-process (similar report from another user) - https://issues.apache.org/jira/browse/CASSANDRA-13984 (similar report from another user) (WARNING) **REMINDER:** UNINSTALL ADDITIONAL cassandra-dev PACKAGES (see: [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/cassandra_dev.pp#10 | operations/puppet/modules/profile/manifests/cassandra_dev.pp ]]) BEFORE CLOSING (WARNING) **REMINDER:** ~~ROLLBACK https://gerrit.wikimedia.org/r/c/operations/puppet/+/906131 BEFORE CLOSING~~