Page MenuHomePhabricator

Kask: gocql: no hosts available in the pool errors
Closed, ResolvedPublic

Description

It would seem that if Kask loses connectivity to Cassandra (via the gocql driver), the host is permanently de-pooled (never to be re-pooled). This results in the following error message:

Error reading from storage (gocql: no hosts available in the pool)

Once this happens, the container running Kask must be restarted.

This seems to correlate with: gocql/gocql/issues/915

We should coordinate with upstream on a fix for this. In the meantime, it may be worth working around this in Kask by re-creating the session object when this error occurs.

See also:


IMPACT

This issue has already resulted in two separate sessionstore incidents, most recently a spike in errors after a node was rebooted. While the affected node was rebooting, the remaining nodes depooled their connections to it, but as a result of this bug, were unable to reestablish those connections after it was back up. With the Cassandra cluster healthy (all nodes up), but one node unreachable from Kask, QUORUMs were chosen that included the unreachable node. This is easily reproducible; We are likely to see this happen again.

See: https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues

Event Timeline

Eevans raised the priority of this task from Medium to High.Oct 17 2022, 4:24 PM
Eevans added a project: Cassandra.

Kask's dependencies are sourced entirely from Debian, the rationale for which can be found documented here. The most current version of the gocql driver in any version of Debian is 0.0~git20191102.0.9faa4c0-4 (the version we are already using); Continuing this practice will mean creating an updated package and adding it to a repository (preferably Debian, but possibly our own in the near-term).

To update to the latest gocql driver release (1.2.1 as of the time of this writing), will roughly require:

  • Packaging golang-github-pierrec-lz4.v4-dev (not currently in any version of Debian, but package source already exists on Salsa), and uploading it to sid
  • Updating golang-github-gocql-gocql-dev, and uploading it to sid
  • Uploading the updated golang-github-pierrec-lz4.v4-dev & golang-github-gocql-gocql-dev packages to apt.wikimedia.org (as an interim solution)
  • (Eventually) uploading golang-github-pierrec-lz4.v4-dev & golang-github-gocql-gocql-dev to bullseye-backports
  • (Eventually) removing golang-github-pierrec-lz4.v4-dev & golang-github-gocql-gocql-dev from apt.wikimedia.org

The alternative would be to update Kask to Go Modules, and henceforth source dependencies from the respective Github repos. I still wholeheartedly believe in the rationale for Debian-sourced dependencies, but feel compelled to present this option since the former will take some hours of work, and the latter...minutes.

[ ... ]

To update to the latest gocql driver release (1.2.1 as of the time of this writing), will roughly require:

  • Packaging golang-github-pierrec-lz4.v4-dev (not currently in any version of Debian, but package source already exists on Salsa), and uploading it to sid
  • ...
NOTE: Upload of golang-github-pierrec-lz4.v4-dev to Debian is currently (since May) blocked on licensing issues.

Change 855102 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] Add component/gocql to bullseye

https://gerrit.wikimedia.org/r/855102

Change 855102 merged by Eevans:

[operations/puppet@production] Add component/gocql to bullseye

https://gerrit.wikimedia.org/r/855102

Change 856009 had a related patch set uploaded (by Eevans; author: Eevans):

[mediawiki/services/kask@master] Upgrade build environment & dependencies

https://gerrit.wikimedia.org/r/856009

Change 856009 merged by jenkins-bot:

[mediawiki/services/kask@master] Upgrade build environment & dependencies

https://gerrit.wikimedia.org/r/856009

This is complete with the deployment of Kask v1.0.10