As part of T179422: Reshape RESTBase Cassandra clusters, restbase2002-{a,b} were bootstrapped into the Cassandra 3.x cluster. One or both of these bootstraps puts the nodes they were bootstrapping from into a state of high load, eventually culminating in instance outages.
Restarting the instances resolved the outages, but the aberrant utilization continues.
19:19 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.165 port 9042 19:57 -stashbot:#wikimedia-operations- T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 20:00 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-b valid until 2018-08-17 16:11:45 +0000 (expires in 275 days) 20:01 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b service on restbase2002 is OK: OK - cassandra-b is active 23:01 <urandom> !log Decommissioning Cassandra, restbase1014-a.eqiad.wmnet (T179422) 23:01 -stashbot:#wikimedia-operations- T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 23:52 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.166 port 9042 03:19 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 03:22 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 7.129 second response time on 10.192.16.162 port 9042 05:36 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 05:38 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 7.340 second response time on 10.192.16.164 port 9042 06:11 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 06:14 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 3.068 second response time on 10.192.16.162 port 9042 06:38 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 06:39 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 1.068 second response time on 10.192.16.162 port 9042 07:19 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 07:20 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 07:41 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 07:42 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 3.055 second response time on 10.192.16.162 port 9042 07:56 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer 07:58 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused 07:59 -icinga-wm:#wikimedia-operations- PROBLEM - cassandra-a service on restbase2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed 08:04 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a service on restbase2001 is OK: OK - cassandra-a is active 08:07 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2018-08-17 16:11:39 +0000 (expires in 275 days) 08:08 -icinga-wm:#wikimedia-operations- RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042