Page MenuHomePhabricator

Aberrant load on instances involved in recent bootstrap
Closed, ResolvedPublic

Description

As part of T179422: Reshape RESTBase Cassandra clusters, restbase2002-{a,b} were bootstrapped into the Cassandra 3.x cluster. One or both of these bootstraps puts the nodes they were bootstrapping from into a state of high load, eventually culminating in instance outages.

Restarting the instances resolved the outages, but the aberrant utilization continues.


19:19 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.165 port 9042
19:57 -stashbot:#wikimedia-operations- T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422
20:00 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-b valid until 2018-08-17 16:11:45 +0000 (expires in 275 days)
20:01 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b service on restbase2002 is OK: OK - cassandra-b is active
23:01  <urandom> !log Decommissioning Cassandra, restbase1014-a.eqiad.wmnet (T179422)
23:01 -stashbot:#wikimedia-operations- T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422
23:52 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.166 port 9042
03:19 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
03:22 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 7.129 second response time on 10.192.16.162 port 9042
05:36 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
05:38 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 7.340 second response time on 10.192.16.164 port 9042
06:11 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
06:14 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 3.068 second response time on 10.192.16.162 port 9042
06:38 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
06:39 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 1.068 second response time on 10.192.16.162 port 9042
07:19 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
07:20 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042
07:41 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
07:42 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 3.055 second response time on 10.192.16.162 port 9042
07:56 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer
07:58 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused
07:59 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
08:04 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active
08:07 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-08-17 16:11:39 +0000 (expires in 275 days)
08:08 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2017-11-15T09:19:33Z] <godog> restart cassandra on restbase2001-c - T180568

Mentioned in SAL (#wikimedia-operations) [2017-11-15T09:32:36Z] <godog> restart cassandra on restbase2002-b - T180568

Eevans triaged this task as High priority.Nov 15 2017, 3:27 PM
Eevans renamed this task from cassandra unresponsive on restbase2001-c to Abberant load on instances involved in recent bootstrap.Nov 15 2017, 3:37 PM
Eevans added a project: User-Eevans.
Eevans updated the task description. (Show Details)

Elevated load and GC collection time starting when 2002-b began bootstrapping (after the bootstrap of 2002-a), but extending well beyond the completion of the bootstrap (> 10 hours).

Screenshot-2017-11-15 Grafana - Host overview.png (1×2 px, 218 KB)

Screenshot-2017-11-15 Grafana - Cassandra.png (868×2 px, 353 KB)

Pending compactions increase during the bootstraps (as expected), but hasn't dropped down to normal levels.

Screenshot-2017-11-15 Grafana - Cassandra(1).png (932×2 px, 302 KB)

Screenshot-2017-11-15 Grafana - cassandra-threadpools.png (871×2 px, 241 KB)

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.
ema renamed this task from Abberant load on instances involved in recent bootstrap to Aberrant load on instances involved in recent bootstrap.Nov 15 2017, 3:53 PM

Mentioned in SAL (#wikimedia-operations) [2017-11-15T19:01:55Z] <urandom> Restarting Cassandra instances on restbase2001.codfw.wmnet (T180568)

Mentioned in SAL (#wikimedia-operations) [2017-11-15T20:52:52Z] <urandom> Restarting restbase2002-a.codfw.wmnet (T180568)

I found the following keyspaces (some of which are quite high traffic) to be erroneously configured to use leveled compaction. Some of these have an SSTable count that is considerably elevated starting at the time of the bootstrap.

commons_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04
commons_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0
enwiki_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds
enwiki_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w
others_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04
others_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0
wikipedia_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLd
wikipedia_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_

I attempted an ALTER of the last two in this list (the wikipedia mobileapps tables) as a test, but due to the load they were under, the schemas remained in disagreement on the affected nodes for an extended period, so I restarted them (2001-{a,b,c} and 2002-b). After the restarts, the schema agreed, and the load returned to normal.

It isn't entirely clear whether it was the ALTER or the restart that cleared the high utilization, but I strong suspect it was the former, since these instances were restarted earlier today, and immediately returned to a state of high load.

Mentioned in SAL (#wikimedia-operations) [2017-11-15T22:10:59Z] <urandom> Setting keyspaces erroneously configured for leveled compaction, to use size-tiered (T180568)

Eevans claimed this task.

These nodes have been stable now for some time so I'll close this ticket to signal that it should no longer be considered a blocker to T179421: Migrate revisions and restrictions from legacy to new storage.