Page MenuHomePhabricator

Make icinga monitoring more relevant
Closed, DuplicatePublic

Description

From https://wikitech.wikimedia.org/wiki/Incident_documentation/20150615-Elasticsearch in part:

  • Icinga should detect gc death spirals

(P782)

  • Icinga should monitor the state of a node within the cluster itself and not just overall cluster health

(https://github.com/elastic/elasticsearch/issues/6801)

  • Icinga should probably alert on yellow cluster mode as well
Conclusions

    ES monitoring does not reflect properly the state of the cluster: it does not warn in yellow state, and general health monitoring was not enough to detect this particular case
    ES topology could be improved, as suggested by several people: things like master nodes not being data nodes, and maybe decoupling more wiki searches?
    Difficulty of testing ES java configurations, such as gc settings
    Ganglia tie-in for ES stats is error-prone and gets in the way during an outage

Event Timeline

chasemp raised the priority of this task from to Medium.
chasemp updated the task description. (Show Details)

This seems mostly (but not entirely) a duplicate of T133844: Improve Elasticsearch icinga alerting; I'm going to merge these two tasks together. If someone disagrees, feel free to unmerge and specify how they're different. :-)