From https://wikitech.wikimedia.org/wiki/Incident_documentation/20150615-Elasticsearch in part:
- Icinga should detect gc death spirals
(P782)
- Icinga should monitor the state of a node within the cluster itself and not just overall cluster health
(https://github.com/elastic/elasticsearch/issues/6801)
- Icinga should probably alert on yellow cluster mode as well
Conclusions ES monitoring does not reflect properly the state of the cluster: it does not warn in yellow state, and general health monitoring was not enough to detect this particular case ES topology could be improved, as suggested by several people: things like master nodes not being data nodes, and maybe decoupling more wiki searches? Difficulty of testing ES java configurations, such as gc settings Ganglia tie-in for ES stats is error-prone and gets in the way during an outage