Page MenuHomePhabricator

Investigate why cassandra per-article-daily oozie jobs fail regularly
Closed, ResolvedPublic5 Estimated Story Points

Event Timeline

JAllemandou renamed this task from Investigate why cassandra per-article-daily oozie jobs fail regularly to 5Investigate why cassandra per-article-daily oozie jobs fail regularly.Jul 20 2016, 8:42 AM
JAllemandou renamed this task from 5Investigate why cassandra per-article-daily oozie jobs fail regularly to Investigate why cassandra per-article-daily oozie jobs fail regularly.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
JAllemandou set the point value for this task to 5.

Reloading of 2016-07-19 worked, and 2016-07-20 worked straight away. weird.

Noticed this error from logstash:

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses.
	at org.apache.cassandra.auth.Auth.selectUser(Auth.java:276) ~[apache-cassandra-2.1.13.jar:2.1.13]
	at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:97) ~[apache-cassandra-2.1.13.jar:2.1.13]
	at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50) ~[apache-cassandra-2.1.13.jar:2.1.13]
	at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:67) ~[apache-cassandra-2.1.13.jar:2.1.13]
	at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:124) ~[apache-cassandra-2.1.13.jar:2.1.13]
	at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:121) ~[apache-cassandra-2.1.13.jar:2.1.13]
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3524) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2317) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2280) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2195) ~[guava-16.0.jar:na]
	... 23 common frames omitted

Seems to be an issue with the Auth table. From https://issues.apache.org/jira/browse/CASSANDRA-7894 it seems that we could tune the permissions_validity_in_ms parameter.

Change 301083 had a related patch set uploaded (by Elukey):
Add the permissions_validity_in_ms among the configurable parameters

https://gerrit.wikimedia.org/r/301083

Change 301083 merged by Elukey:
Add the permissions_validity_in_ms among the configurable parameters

https://gerrit.wikimedia.org/r/301083

Change 301365 had a related patch set uploaded (by Elukey):
Increase AQS default auth permission caching to 30s.

https://gerrit.wikimedia.org/r/301365

Change 301365 merged by Elukey:
Increase AQS default auth permission caching to 30s.

https://gerrit.wikimedia.org/r/301365

Mentioned in SAL [2016-07-27T13:29:43Z] <elukey> Restart Cassandra on aqs100[123] to apply the latest configuration (T140869)

Next step is to wait some rounds of loading jobs to see if this change had the desired impact.

Change 301780 had a related patch set uploaded (by Elukey):
Raise Cassandra auth caching to 10 minutes

https://gerrit.wikimedia.org/r/301780

Change 301780 merged by Elukey:
Raise Cassandra auth caching to 10 minutes

https://gerrit.wikimedia.org/r/301780

Mentioned in SAL [2016-07-29T10:23:52Z] <elukey> restarting cassandra on aqs100[123] to apply the latest config (https://gerrit.wikimedia.org/r/#/c/301780/1 - T140869)

So far the results are really good, both latencies (especially per article mean, p75 and p99) and READ timeouts decreased a lot since we applied the patch (~27 Jul at 13 UTC):

Screen Shot 2016-07-29 at 1.08.10 PM.png (510×2 px, 298 KB)

Screen Shot 2016-07-29 at 1.08.02 PM.png (456×2 px, 411 KB)

This seems to indicate that we are going in the right direction. I increased the auth caching to 10 minutes to see if we can get even more performance improvements.

@elukey Interesting. We are seeing a few of these in the RESTBase cluster too, 32 over the last 12 hours, all on the same host. I'm not sure this is having the same impact there, but permissions changes are rare enough that I think I'll similarly raise the validity period for that cluster just in case.

Thanks for the heads up!

Change 301878 had a related patch set uploaded (by Eevans):
Increase permissions validity on RESTBase cluster

https://gerrit.wikimedia.org/r/301878

Change 301878 merged by Elukey:
Increase permissions validity on RESTBase cluster

https://gerrit.wikimedia.org/r/301878

Mentioned in SAL [2016-08-04T17:14:51Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-b.eqiad.wmnet

Mentioned in SAL [2016-08-04T17:17:37Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1007-c.eqiad.wmnet

Mentioned in SAL [2016-08-04T19:37:28Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-a.eqiad.wmnet

Mentioned in SAL [2016-08-04T19:40:07Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-b.eqiad.wmnet

Mentioned in SAL [2016-08-04T19:42:22Z] <urandom> T140825,T140869: Restarting Cassandra, restbase1010-c.eqiad.wmnet

Mentioned in SAL [2016-08-04T20:03:38Z] <urandom> T140825,T140869: Performing Cassandra instance rolling restart of restbase1011.eqiad.wmnet

Mentioned in SAL [2016-08-04T20:10:55Z] <urandom> T140825,T140869: Cassandra instance restarts complete: restbase1011.eqiad.wmnet

Mentioned in SAL [2016-08-04T20:13:20Z] <urandom> T140825,T140869: Performing rolling restart of codfw Cassandra instances

Mentioned in SAL [2016-08-04T21:14:18Z] <urandom> T140825,T140869: Rolling restart of codfw Cassandra instances complete

Mentioned in SAL [2016-08-04T21:26:10Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b'

Mentioned in SAL [2016-08-04T21:48:28Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `b', complete

Mentioned in SAL [2016-08-04T21:50:14Z] <urandom> T140825,T140869: Rolling restart of Cassandra instances, eqiad Rack `d'