When the elasticsearch servers run out of heap memory they start intermittently triggering our current latency and old gc/hr limits, but those don't end up being particularly actionable because they can trigger for lots of other reasons. On review of our Elasticsearch Memory dashboards for instances that have been having trouble recently there are some metrics that might more clearly distinguish an instance that needs to be rebooted, and possibly temp-banned from the cluster to rebalance shards.
- p95 latency for all requests made between cirrus and elastic. When this alerts it says something might be wrong, but nothing about what it might be.
- > 100 old gc/hour. Current problem servers steady state at around 20-25 and don't trigger the alert. Should it be lower?
Some metrics we could think about using instead:
- JVM Heap - survivor pool goes from varying up to a couple hundred mb's to holding a solid value of 0. A possible alert could be if survivor pool has held close to 0 for the last N hours.
- JVM Heap - old pool goes from a 1+ GB sawtooth to a ~10MB sawtooth, almost flatlining against the max value. A possible alert could be if max-min over the last N hours is less than X MB.