Recently, we encountered a bug that caused a title to be re-rendered on each nrpe health check. As the row associated with this title got wider and wider, read latency increased, as did memory allocation for the effected queries, eventually culminating in Cassandra OOM exceptions. There have been similar bugs in the past as well. We should invest effort into proactively alerting on such changes to storage.
Metrics of interest:
- Row size (tricky if we allow rows to grow unbounded; a static threshold is probably not sufficient)
- Column count (same as above, a static threshold will probably not work)
- Tombstones (can be grokked from logstash)
- Others?
References: