Page MenuHomePhabricator

Alert when a jvm hits more than 100 old gc ops/hour
Closed, ResolvedPublic

Description

Cloudelastic jvms are suffering from weird behaviors of the GC causing slowdowns of the whole cluster and therefor slowing consumption of production MW JobQueues.

We should alert when the GC operations hit a critical threshold, 100 ops seem a good value for raising a critical alert on the number of old gc/hour.
The prometheus metric is elasticsearch_jvm_gc_collection_seconds_count{gc="old"} (used in https://grafana.wikimedia.org/d/000000462/elasticsearch-memory)

Event Timeline

dcausse created this task.Aug 29 2019, 7:10 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2019, 7:10 AM
dcausse triaged this task as High priority.Aug 29 2019, 7:10 AM
dcausse moved this task from needs triage to Ops / SRE on the Discovery-Search board.
dcausse added a subscriber: Mathew.onipe.

Change 533189 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: add old JVM GC check

https://gerrit.wikimedia.org/r/533189

On another note, I think this check make sense for other clusters as well

Change 533189 merged by Gehel:
[operations/puppet@production] icinga: add old JVM GC check for elastic

https://gerrit.wikimedia.org/r/533189

debt closed this task as Resolved.Sep 5 2019, 6:24 PM