Currently, the check_elasticsearch_shards icinga command is used to notify of Elasticsearch outage. This check has several limitations:
- it runs against each elasticsearch node, but report on a global cluster state. In case of failure it sends a flood of alerts
- It raises alert on a % of shards being in error, as a way to not warn during reindex
A better solution would be:
- not run the check on all hosts, but against the service (search.svc.[codfw|eqiad].wmnet)
- we access indexes by alias, so we can check that all indexes that have an alias are green