Page MenuHomePhabricator

Create Icinga check for failed shard allocation
Closed, ResolvedPublic

Description

Following https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles-prometheus?panelId=64&fullscreen&orgId=1&from=now-30d&to=now&var-cluster=eqiad&var-smoothing=1&var-exported_cluster=search&edit, we discovered a shard was unassigned since 02/12/2018.

We should have icinga alert us if there is any case as such.
Querying /_cluster/allocation/explain should give us what we need. This check should happen at low frequency. I'm proposing:
Freq: Daily or 24h
retry: 1h
tries: 3

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 482297 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] Elasticsearch failed shard allocation check

https://gerrit.wikimedia.org/r/482297

Change 482297 merged by Gehel:
[operations/puppet@production] Elasticsearch failed shard allocation check

https://gerrit.wikimedia.org/r/482297

This check has been deployed for the main cirrus clusters (eqiad+codfw).

We still need to add it for :

  • psi / omega cirrus clusters
  • relforge
  • logstash

Change 484679 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: enable check for psi and omega cluster

https://gerrit.wikimedia.org/r/484679

Change 484685 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: enable check for logstash

https://gerrit.wikimedia.org/r/484685

Change 484679 merged by Gehel:
[operations/puppet@production] icinga: enable check for psi and omega clusters

https://gerrit.wikimedia.org/r/484679

Change 488453 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: fixed duplicated check description

https://gerrit.wikimedia.org/r/488453

Change 488485 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: enable check for psi and omega clusters

https://gerrit.wikimedia.org/r/488485

Change 488485 merged by Gehel:
[operations/puppet@production] icinga: enable check for psi and omega clusters

https://gerrit.wikimedia.org/r/488485

Change 489154 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: enable check for psi and omega clusters

https://gerrit.wikimedia.org/r/489154

Change 489154 merged by Gehel:
[operations/puppet@production] icinga: enable check for psi and omega clusters

https://gerrit.wikimedia.org/r/489154

Change 489765 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: unassigned shard icinga check

https://gerrit.wikimedia.org/r/489765

Change 489765 merged by Gehel:
[operations/puppet@production] elasticsearch: unassigned shard icinga check

https://gerrit.wikimedia.org/r/489765

Change 484685 abandoned by Mathew.onipe:
icinga: enable check for logstash

Reason:
was merged here: https://gerrit.wikimedia.org/r/c/operations/puppet/ /489765

https://gerrit.wikimedia.org/r/484685

Change 488453 abandoned by Gehel:
elasticsearch: fixed duplicated check description

Reason:
already refactored

https://gerrit.wikimedia.org/r/488453