Alert on abnormal storage growth patterns
Open, LowPublic
Actions

Assigned To

None

Authored By

	Eevans
	Nov 12 2015, 11:22 PM

Description

Recently, we encountered a bug that caused a title to be re-rendered on each nrpe health check. As the row associated with this title got wider and wider, read latency increased, as did memory allocation for the effected queries, eventually culminating in Cassandra OOM exceptions. There have been similar bugs in the past as well. We should invest effort into proactively alerting on such changes to storage.

Metrics of interest:

Row size (tricky if we allow rows to grow unbounded; a static threshold is probably not sufficient)
Column count (same as above, a static threshold will probably not work)
Tombstones (can be grokked from logstash)
Others?

References:

Related Objects

Mentioned In: T116861: Investigate OOM and elevated read latencies on 1007
Mentioned Here: T105509: Secondary updates create hundreds of unnecessary writes per second
T116739: restbase endpoints health checks timing out
T116861: Investigate OOM and elevated read latencies on 1007

Event Timeline

Eevans created this task.Nov 12 2015, 11:22 PM

Eevans raised the priority of this task from to Medium.

Eevans updated the task description. (Show Details)

Eevans added a project: RESTBase.

Eevans subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 12 2015, 11:22 PM

Eevans mentioned this in T116861: Investigate OOM and elevated read latencies on 1007.Nov 12 2015, 11:26 PM

fgiunchedi subscribed.Nov 16 2015, 10:43 AM

Eevans renamed this task from alert on abnormal storage growth patterns to Alert on abnormal storage growth patterns.Apr 29 2016, 8:39 PM

Eevans added a project: Cassandra.

Eevans moved this task from Backlog to Next on the Cassandra board.Aug 15 2016, 8:19 PM

• GWicke added a project: Services.Oct 12 2016, 11:19 PM

Eevans moved this task from Next to Backlog on the Cassandra board.Nov 29 2016, 9:30 PM

• GWicke moved this task from Backlog to later on the Services board.Jul 11 2017, 8:03 PM

• GWicke edited projects, added Services (later); removed Services.

• GWicke moved this task from later to designing on the Services board.

• GWicke edited projects, added Services (designing); removed Services (later).

• mobrovac added a project: Platform Team Legacy (Designing).Dec 20 2018, 12:55 PM

hnowlan moved this task from Backlog to In-Progress on the Cassandra board.Aug 26 2021, 1:02 PM

Eevans moved this task from In-Progress to Backlog on the Cassandra board.May 6 2022, 4:05 PM

Eevans lowered the priority of this task from Medium to Low.Sep 19 2023, 8:04 PM

Alert on abnormal storage growth patternsOpen, LowPublicActions

Description

Related Objects

Event Timeline

Alert on abnormal storage growth patterns
Open, LowPublic
Actions