Page MenuHomePhabricator

reconfigure Icinga alert for elasticsearch_shard_size to reduce false positive alerts
Closed, ResolvedPublic

Description

Following the recent reindexing, we observed that Icinga has been throwing some false positives due to segment merges etc. We should reconfigure icinga to limit these false alerts etc.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2018, 2:10 AM

Checking icinga again, I can see something is wrong with the check time. It keeps checking every two minutes or so..

Mathew.onipe triaged this task as Normal priority.Oct 4 2018, 2:18 AM
Mathew.onipe added a subscriber: fgiunchedi.
Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptOct 4 2018, 7:29 AM

Change 464570 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga::monitor::elasticsearch: throttle alerts notification for check_elasticsearch_shard_size

https://gerrit.wikimedia.org/r/464570

Change 464570 merged by Gehel:
[operations/puppet@production] icinga::monitor::elasticsearch: throttle alerts notifications

https://gerrit.wikimedia.org/r/464570

After watching the trend of this check for about a week now, I discovered that wikis like enwiki, wikidatawiki and cebwiki shards sizes usually grow beyond the warning threshold but never hit the critical threshold before some of them go back below the warning threshold.

The throttling was obviously a good idea, but I suggest we increase the warning and critical threshold. Currently, warning is 35gb while critical is 50gb. I suggest we make warning 50gb and critical 60gb. Such that if any index hit the warning threshold and stays there for a while (a week), then an inplace reindexing should immediately follow.

Gehel added a comment.Oct 15 2018, 8:19 AM

I think the proposal make sense. This check is here so that we don't forget to reshard when needed, but there isn't a hard limit on the max shard size (well, there is the overall disk space, but we're going to be in trouble well before that). The main goal being to get a low priority alert when things are climbing too high. And "too high" isn't well defined. So we have some latitude as to what limit we want to set.

The main point is that we should ensure that this check does not flap too much, and does not alert us too early.

In short: I think W=50GB and C=60Gb is fine.

Change 467322 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: modify thresholds for icinga check shard size plugin

https://gerrit.wikimedia.org/r/467322

Change 467322 merged by Gehel:
[operations/puppet@production] elasticsearch: modify thresholds for icinga check shard size plugin

https://gerrit.wikimedia.org/r/467322

debt closed this task as Resolved.Oct 19 2018, 2:43 PM