Page MenuHomePhabricator

monitor circuit breaker exceptions in elasticsearch
Closed, ResolvedPublic3 Estimated Story Points

Description

elasticsearch 7 has some expanded circuit breakers that reject updates and search requests that would blow out the memory limits. We should have some sort of prometheus metric that monitors these, and alerts so we know if things are being rejected.

AC:

  • metric available in prometheus
  • alert is raised when circuit breakers are rejecting too many requests, alert limit is tuned to something that makes sense

Event Timeline

Gehel triaged this task as High priority.Sep 5 2022, 3:25 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 3.

Change 830239 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add error classification group for memory errors

https://gerrit.wikimedia.org/r/830239

Change 830240 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] Add alert for CirrusSearch reported memory issues

https://gerrit.wikimedia.org/r/830240

Checked the errors we logged in beta cluster during initial 7.10 deployment. They are all quite consistent and seem to be appropriately parsed out by the existing request error handlers. Attached patches classify these as a new type of failure, and then alert based on the recorded count of that failure type.

I've also updated the related grafana dashboard so it will show this error type.

Change 830239 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add error classification group for memory errors

https://gerrit.wikimedia.org/r/830239

Change 865165 had a related patch set uploaded (by Ryan Kemper; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Correct classification of circuit_breaking_exception

https://gerrit.wikimedia.org/r/865165

Change 865165 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Correct classification of circuit_breaking_exception

https://gerrit.wikimedia.org/r/865165

Change 830240 merged by Ryan Kemper:

[operations/puppet@production] Add alert for CirrusSearch reported memory issues

https://gerrit.wikimedia.org/r/830240