Page MenuHomePhabricator

Upstream gocql bug effects Kask
Closed, ResolvedPublic

Description

As observed when investigating T252898: echostore connection error in Beta Cluster, a transient Cassandra failure isn't (always) properly handled by the GoCQL driver.

May 20 15:37:33 deployment-echostore01 docker[10577]: {"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2020-05-20T15:37:33Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
May 20 15:37:33 deployment-echostore01 docker[10577]: {"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2020-05-20T15:37:33Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
May 20 15:37:33 deployment-echostore01 docker[10577]: {"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2020-05-20T15:37:33Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
May 20 15:37:42 deployment-echostore01 docker[10577]: {"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2020-05-20T15:37:42Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
May 20 15:37:42 deployment-echostore01 docker[10577]: {"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2020-05-20T15:37:42Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
May 20 15:41:29 deployment-echostore01 docker[10577]: {"msg":"Error writing to storage (gocql: no hosts available in the pool)","appname":"sessions","time":"2020-05-20T15:41:29Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}

This seems to be: gocql/gocql/issues/915

gocql/gocql/issues/915#issuecomment-325329596 documents how to go about reproducing this, and Github user Zariel has requested that someone do so with the gocql_debug build tag enabled, and provide them with them the output (we should probably begin there).

Event Timeline

Change 597881 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/kask@master] Add gocql_debug build tag

https://gerrit.wikimedia.org/r/597881

Change 597881 merged by jenkins-bot:
[mediawiki/services/kask@deployment-prep-debug] Add gocql_debug build tag

https://gerrit.wikimedia.org/r/597881

eprodromou subscribed.

OK, we've done what we can with this, so please re-open if this appears again.

We deployed a new Kask image with the gocql_debug build tag to echostore and followed up on Github. Reopening the ticket for now and moving to blocked externally but if the error is not reproduced in 2 months, I suggest rolling back the image and removing the clinic duty tag

Aklapper added a subscriber: Clarakosi.

Removing inactive task assignee

Eevans added a project: User-Eevans.
Eevans added a subscriber: LSobanski.
Eevans raised the priority of this task from Medium to High.Sep 15 2022, 3:34 PM

Change 857711 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] sessionstore: bump container version to v1.0.10

https://gerrit.wikimedia.org/r/857711

Change 857711 merged by jenkins-bot:

[operations/deployment-charts@master] sessionstore: bump container version to v1.0.10

https://gerrit.wikimedia.org/r/857711

This has been deployed to sessionstore (production). It still needs to be deployed to:

  • sessionstore deployment-prep
  • echostore production
  • echostore deployment-prep

Change 861925 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] echostore: bump container version to v1.0.10

https://gerrit.wikimedia.org/r/861925

Change 861925 merged by jenkins-bot:

[operations/deployment-charts@master] echostore: bump container version to v1.0.10

https://gerrit.wikimedia.org/r/861925

Change 862307 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] echostore: bring codfw hosts up to date

https://gerrit.wikimedia.org/r/862307

This is complete with the deployment of Kask v1.0.10

Change 862307 merged by jenkins-bot:

[operations/deployment-charts@master] echostore: bring codfw hosts up to date

https://gerrit.wikimedia.org/r/862307