It would seem that if Kask loses connectivity to Cassandra (via the gocql driver), the host is permanently de-pooled (never to be re-pooled). This results in the following error message:
Error reading from storage (gocql: no hosts available in the pool)
Once this happens, the container running Kask must be restarted.
This seems to correlate with: gocql/gocql/issues/915
We should coordinate with upstream on a fix for this. In the meantime, it may be worth working around this in Kask by re-creating the session object when this error occurs.
See also:
- T253244: Upstream gocql bug effects Kask
- gocql/commit/312a614 (possibly upstream fix)
This issue has already resulted in two separate sessionstore incidents, most recently a spike in errors after a node was rebooted. While the affected node was rebooting, the remaining nodes depooled their connections to it, but as a result of this bug, were unable to reestablish those connections after it was back up. With the Cassandra cluster healthy (all nodes up), but one node unreachable from Kask, QUORUMs were chosen that included the unreachable node. This is easily reproducible; We are likely to see this happen again.
See: https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues