Page MenuHomePhabricator

Alert when elasticsearch has shards larger than a maximum size
Closed, ResolvedPublic

Description

We have a documented rule that shards on the cirrus cluster should be 30GB max. When shards start to grow over this limit, relocation of shards becomes complicated and we should increase the number of shards for this index. We have been surprised a few times by shards growing up to 70GB.

An icinga check, running at low frequency (once per day is enough) would help identify those shards in a timely fashion. This check should have warning and critical threshold. It should report the indices that have shards over limit.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as Medium priority.Sep 5 2018, 8:18 AM

Change 458891 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch shard size check * Checks shard size and sends alert if more than 30gb.

https://gerrit.wikimedia.org/r/458891

Output of testing the shard size check script on relforge:

onimisionipe@relforge1001:~/tests$ python3 shard_el.py --shard-size-warning 25 --shard-size-critical 40
CRITICAL - stas_wikidata_test:6 (size=49gb), stas_wikidata_test:5 (size=49gb), stas_wikidata_test:4 (size=49gb), stas_wikidata_test:3 (size=49gb), stas_wikidata_test:2 (size=49gb), stas_wikidata_test:1 (size=49gb), stas_wikidata_test:0 (size=49gb), commons_image_quality:14 (size=30gb), commons_image_quality:11 (size=30gb), commons_image_quality:10 (size=30gb), commons_image_quality:8 (size=29gb), commons_image_quality:2 (size=29gb), commons_image_quality:7 (size=27gb), commons_image_quality:6 (size=27gb), commons_image_quality:3 (size=27gb), commons_image_quality:0 (size=27gb), commons_image_quality:5 (size=26gb)

Change 458891 merged by Gehel:
[operations/puppet@production] Elasticsearch shard size check

https://gerrit.wikimedia.org/r/458891