Page MenuHomePhabricator

Kartotherian service on maps100[2-4] timed out on when trying to get tiles.
Closed, ResolvedPublic

Description

During T198622 which went well, I changed the replication factor from 1 => 2 for system_auth keyspace and also from 2=>3 for v4 keyspace, then we started seeing these errors in the logs:

{"name":"kartotherian","hostname":"maps1002","pid":158,"level":60,"err":{"message":"User kartotherian has no SELECT permission on <table v4.tiles> or any of its parents","name":"ResponseError","stack":"Error: User kartotherian has no SELECT permission on <table v4.tiles> or any of its parents\n    at ResponseError.DriverError (/srv/deployment/kartotherian/deploy-cache/revs/e847e7b40df808a760876f3a7a2cbea49e77ce3d/node_modules/cassandra-driver/lib/errors.js:14:19)\n    at new ResponseError (/srv/deployment/kartotherian/deploy-cache/revs/e847e7b40df808a760876f3a7a2cbea49e77ce3d/node_modules/cassandra-driver/lib/errors.js:51:24)\n    at FrameReader.readError (/srv/deployment/kartotherian/deploy-cache/revs/e847e7b40df808a760876f3a7a2cbea49e77ce3d/node_modules/cassandra-driver/lib/readers.js:316:13)\n    at Parser.parseBody (/srv/deployment/kartotherian/deploy-cache/revs/e847e7b40df808a760876f3a7a2cbea49e77ce3d/node_modules/cassandra-driver/lib/streams.js:180:66)\n    at Parser._transform (/srv/deployment/kartotherian/deploy-cache/revs/e847e7b40df808a760876f3a7a2cbea49e77ce3d/node_modules/cassandra-driver/lib/streams.js:135:10)\n    at Parser.Transform._read (_stream_transform.js:167:10)\n    at Parser.Transform._write (_stream_transform.js:155:12)\n    at doWrite (_stream_writable.js:331:12)\n    at writeOrBuffer (_stream_writable.js:317:5)\n    at Parser.Writable.write (_stream_writable.js:243:11)\n    at Protocol.ondata (_stream_readable.js:555:20)\n    at emitOne (events.js:96:13)\n    at Protocol.emit (events.js:188:7)\n    at readableAddChunk (_stream_readable.js:176:18)\n    at Protocol.Readable.push (_stream_readable.js:134:10)\n    at Protocol.Transform.push (_stream_transform.js:128:32)","code":8448,"info":"Represents an error message from the server","coordinator":"10.64.32.117:9042","query":"SELECT tile, WRITETIME(tile) AS wt FROM tiles WHERE zoom = ? AND idx = ? AND block = ?","moduleUri":"{\"maxzoom\":15,\"keyspace\":\"v4\",\"cp\":[\"10.64.48.154\",\"10.64.32.117\",\"10.64.16.42\"],\"username\":\"kartotherian\",\"setLastModified\":true}","levelPath":"fatal/service-runner/unhandled"},"msg":"User kartotherian has no SELECT permission on <table v4.tiles> or any of its parents","time":"2019-01-22T19:59:51.283Z","v":0}

After further probing, we strongly believe this is related to: T157354, which fixed similar issues by recreating the user that queries cassandra.

Event Timeline

Steps taken to fix this:

  • run nodetool repair system_auth on all nodes
  • recreate users and permissions, based on the /usr/local/bin/maps-grants.cql script

This looks very much like T157354, we fell into the same trap.

Change 486074 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Revert "cache::upload: depool kartotherian in eqiad"

https://gerrit.wikimedia.org/r/486074

Change 486074 merged by Gehel:
[operations/puppet@production] Revert "cache::upload: depool kartotherian in eqiad"

https://gerrit.wikimedia.org/r/486074

Joe triaged this task as High priority.Feb 7 2019, 12:07 PM
Gehel claimed this task.

The incident is documented in https://wikitech.wikimedia.org/wiki/Incident_documentation/20190122-maps, this is a known issue (T157354).

A better procedure will be tested and documented as part of the reimage of the codfw maps cluster.