Cloudelastic jvms are suffering from weird behaviors of the GC causing slowdowns of the whole cluster and therefor slowing consumption of production MW JobQueues.
We should alert when the GC operations hit a critical threshold, 100 ops seem a good value for raising a critical alert on the number of old gc/hour.
The prometheus metric is elasticsearch_jvm_gc_collection_seconds_count{gc="old"} (used in https://grafana.wikimedia.org/d/000000462/elasticsearch-memory)