Improve Elasticsearch icinga alerting
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Gehel
	Apr 28 2016, 8:26 AM

Description

Currently, the check_elasticsearch_shards icinga command is used to notify of Elasticsearch outage. This check has several limitations:

it runs against each elasticsearch node, but report on a global cluster state. In case of failure it sends a flood of alerts
It raises alert on a % of shards being in error, as a way to not warn during reindex

A better solution would be:

not run the check on all hosts, but against the service (search.svc.[codfw|eqiad].wmnet)
we access indexes by alias, so we can check that all indexes that have an alias are green

Details

	Subject	Repo	Branch	Lines +/-
	elasticsearch - check shards via the service, not via each individual node	operations/puppet	production	+22 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T109089 EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade)
Duplicate	None	T109117 Make icinga monitoring more relevant
Declined	None	T133844 Improve Elasticsearch icinga alerting

Event Timeline

Gehel created this task.Apr 28 2016, 8:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2016, 8:26 AM

fgiunchedi triaged this task as Medium priority.Apr 28 2016, 10:47 AM

Gehel renamed this task from Improve Elasticsearch icinga altering to Improve Elasticsearch icinga alerting.Apr 28 2016, 1:28 PM

Gehel added a parent task: T109117: Make icinga monitoring more relevant.

@Gehel How do you recommend prioritising this?

I'd say this is fairly low priority. Elasticsearch has good monitoring, it is not urgent to improve it significantly. I'd like keep that task for my next "kitchen hackathon" as it seems to be a good introduction to our infrastructure.

Gehel lowered the priority of this task from Medium to Low.May 9 2016, 11:52 AM

Gehel added a project: good first task.Aug 2 2016, 9:24 AM

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptAug 2 2016, 9:24 AM

Keeping track of curl elastic1030.eqiad.wmnet:9200/_cluster/stats | jq .indices.completion.size_in_bytes could help diagnose issue with completion suggester rebuild.

This is generating unwanted noise, let's raise the priority

Raising this task's priority:

Getting a storm of alerts like we get now on cluster failures is distracting, especially when the failure is sourced at other major events at the datacenter (such as network maintenance like today, or random network failures) that also need troubleshooting/debugging. We've seen it twice now during the cr1-eqiad/cr2-eqiad JunOS upgrades and, fortunately, this was planned work so we were kind of around and aware of what the root cause might have been.

Moreover, if we had one service check like the task above suggests, we could also make it paging, thus improving our response times (such as in weekends and such).

Change 305519 had a related patch set uploaded (by Gehel):
Elasticsearch - check shards via the service, not via each individual node

https://gerrit.wikimedia.org/r/305519

gerritbot added a project: Patch-For-Review.Aug 18 2016, 3:31 PM

Gehel mentioned this in rOPUPb1fded6f6fbb: Elasticsearch - check shards via the service, not via each individual node.Aug 18 2016, 3:33 PM

Change 305519 merged by Gehel:
elasticsearch - check shards via the service, not via each individual node

https://gerrit.wikimedia.org/r/305519

Gehel mentioned this in T124542: Setup icinga alerts for discovery services.Sep 22 2016, 2:06 PM

@EBernhardson does not see as many alerts any more.

@Gehel Do you consider this resolved? If not, can you give details on some of the outstanding work? Thanks!

The first part of checking cluster state against the service address (search.svc...) is done, but the check of aliases is not there yet.

• Deskana mentioned this in T109117: Make icinga monitoring more relevant.Nov 3 2016, 10:23 PM

• Deskana merged a task: T109117: Make icinga monitoring more relevant.

• Deskana added subscribers: • chasemp, dcausse, Krenair.

This remains a valid issue, but has not been touched in a while. Changing priority accordingly.

Framawiki moved this task from Backlog to Doing on the good first task board.Dec 2 2017, 1:34 PM

debt removed a project: Patch-For-Review.Jun 14 2018, 5:14 PM

• Phabricator_maintenance moved this task from Doing to Backlog on the good first task board.Sep 24 2018, 9:54 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:38 PM

EBernhardson moved this task from search-icebox to Ops / SRE on the Discovery-Search board.Feb 14 2019, 10:10 PM

Gehel closed this task as Declined.Sep 8 2020, 7:09 PM

Improve Elasticsearch icinga alertingClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Improve Elasticsearch icinga alerting
Closed, DeclinedPublic
Actions

Related Objects
Search...