Page MenuHomePhabricator

Improve Elasticsearch icinga alerting
Closed, DeclinedPublic

Description

Currently, the check_elasticsearch_shards icinga command is used to notify of Elasticsearch outage. This check has several limitations:

  • it runs against each elasticsearch node, but report on a global cluster state. In case of failure it sends a flood of alerts
  • It raises alert on a % of shards being in error, as a way to not warn during reindex

A better solution would be:

  • not run the check on all hosts, but against the service (search.svc.[codfw|eqiad].wmnet)
  • we access indexes by alias, so we can check that all indexes that have an alias are green

Event Timeline

Gehel renamed this task from Improve Elasticsearch icinga altering to Improve Elasticsearch icinga alerting.Apr 28 2016, 1:28 PM

I'd say this is fairly low priority. Elasticsearch has good monitoring, it is not urgent to improve it significantly. I'd like keep that task for my next "kitchen hackathon" as it seems to be a good introduction to our infrastructure.

Gehel lowered the priority of this task from Medium to Low.May 9 2016, 11:52 AM

Keeping track of curl elastic1030.eqiad.wmnet:9200/_cluster/stats | jq .indices.completion.size_in_bytes could help diagnose issue with completion suggester rebuild.

Gehel raised the priority of this task from Low to High.Aug 18 2016, 2:50 PM

This is generating unwanted noise, let's raise the priority

Raising this task's priority:

Getting a storm of alerts like we get now on cluster failures is distracting, especially when the failure is sourced at other major events at the datacenter (such as network maintenance like today, or random network failures) that also need troubleshooting/debugging. We've seen it twice now during the cr1-eqiad/cr2-eqiad JunOS upgrades and, fortunately, this was planned work so we were kind of around and aware of what the root cause might have been.

Moreover, if we had one service check like the task above suggests, we could also make it paging, thus improving our response times (such as in weekends and such).

Change 305519 had a related patch set uploaded (by Gehel):
Elasticsearch - check shards via the service, not via each individual node

https://gerrit.wikimedia.org/r/305519

Change 305519 merged by Gehel:
elasticsearch - check shards via the service, not via each individual node

https://gerrit.wikimedia.org/r/305519

@EBernhardson does not see as many alerts any more.

@Gehel Do you consider this resolved? If not, can you give details on some of the outstanding work? Thanks!

The first part of checking cluster state against the service address (search.svc...) is done, but the check of aliases is not there yet.

Deskana lowered the priority of this task from High to Low.May 4 2017, 5:25 PM
Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.

This remains a valid issue, but has not been touched in a while. Changing priority accordingly.