Details
Related Objects
- Mentioned In
- T140825: Throttle compaction throughput limit in line with instance count
rOPUPd3426d6a1918: Increase permissions validity on RESTBase cluster
rOPUP24a6a09aa566: Increase permissions validity on RESTBase cluster
rOPUPe5af8aa4638d: Raise Cassandra auth caching to 10 minutes
rOPUPffc4dfdd119f: Raise Cassandra auth caching to 10 minutes
rOPUP5e60425168d1: Increase AQS default auth permission caching to 30s.
rOPUP29b9a241e809: Add the permissions_validity_in_ms among the configurable parameters
rOPUP9de686731de0: Add the permissions_validity_in_ms among the configurable parameters
rOPUPd1358aa5c0f8: Add the permissions_validity_in_ms among the configurable parameters
rOPUPb17328da25e7: Add the permissions_validity_in_ms among the configurable parameters - Mentioned Here
- T140825: Throttle compaction throughput limit in line with instance count
Event Timeline
Noticed this error from logstash:
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses. at org.apache.cassandra.auth.Auth.selectUser(Auth.java:276) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:97) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:67) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:124) ~[apache-cassandra-2.1.13.jar:2.1.13] at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:121) ~[apache-cassandra-2.1.13.jar:2.1.13] at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3524) ~[guava-16.0.jar:na] at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2317) ~[guava-16.0.jar:na] at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2280) ~[guava-16.0.jar:na] at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2195) ~[guava-16.0.jar:na] ... 23 common frames omitted
Seems to be an issue with the Auth table. From https://issues.apache.org/jira/browse/CASSANDRA-7894 it seems that we could tune the permissions_validity_in_ms parameter.
Change 301083 had a related patch set uploaded (by Elukey):
Add the permissions_validity_in_ms among the configurable parameters
Change 301083 merged by Elukey:
Add the permissions_validity_in_ms among the configurable parameters
Change 301365 had a related patch set uploaded (by Elukey):
Increase AQS default auth permission caching to 30s.
Mentioned in SAL [2016-07-27T13:29:43Z] <elukey> Restart Cassandra on aqs100[123] to apply the latest configuration (T140869)
Next step is to wait some rounds of loading jobs to see if this change had the desired impact.
Change 301780 had a related patch set uploaded (by Elukey):
Raise Cassandra auth caching to 10 minutes
Mentioned in SAL [2016-07-29T10:23:52Z] <elukey> restarting cassandra on aqs100[123] to apply the latest config (https://gerrit.wikimedia.org/r/#/c/301780/1 - T140869)
So far the results are really good, both latencies (especially per article mean, p75 and p99) and READ timeouts decreased a lot since we applied the patch (~27 Jul at 13 UTC):
This seems to indicate that we are going in the right direction. I increased the auth caching to 10 minutes to see if we can get even more performance improvements.
@elukey Interesting. We are seeing a few of these in the RESTBase cluster too, 32 over the last 12 hours, all on the same host. I'm not sure this is having the same impact there, but permissions changes are rare enough that I think I'll similarly raise the validity period for that cluster just in case.
Thanks for the heads up!
Change 301878 had a related patch set uploaded (by Eevans):
Increase permissions validity on RESTBase cluster
Mentioned in SAL [2016-08-04T17:08:34Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-a.eqiad.wmnet
Mentioned in SAL [2016-08-04T17:14:51Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-b.eqiad.wmnet
Mentioned in SAL [2016-08-04T17:17:37Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-c.eqiad.wmnet
Mentioned in SAL [2016-08-04T19:37:28Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-a.eqiad.wmnet
Mentioned in SAL [2016-08-04T19:40:07Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-b.eqiad.wmnet
Mentioned in SAL [2016-08-04T19:42:22Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-c.eqiad.wmnet
Mentioned in SAL [2016-08-04T20:03:38Z] <urandom> T140825,T140869: Performing Cassandra instance rolling restart of restbase1011.eqiad.wmnet
Mentioned in SAL [2016-08-04T20:10:55Z] <urandom> T140825,T140869: Cassandra instance restarts complete: restbase1011.eqiad.wmnet
Mentioned in SAL [2016-08-04T20:13:20Z] <urandom> T140825,T140869: Performing rolling restart of codfw Cassandra instances
Mentioned in SAL [2016-08-04T21:14:18Z] <urandom> T140825,T140869: Rolling restart of codfw Cassandra instances complete
Mentioned in SAL [2016-08-04T21:26:10Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b'
Mentioned in SAL [2016-08-04T21:48:28Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b', complete
Mentioned in SAL [2016-08-04T21:50:14Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `d'

