Page MenuHomePhabricator

Fix "requests triggering circuit breakers" Elastic alert
Open, In Progress, MediumPublic

Description

Per IRC conversation with @fgiunchedi , this graphite-based alert has been in UNKNOWN state for the last 22 days as I write this.

Creating this ticket to:

  • Investigate the root cause of missing metrics
  • Fix or remove this alert.

Event Timeline

bking changed the task status from Open to In Progress.Jan 24 2024, 3:40 PM
bking claimed this task.

Per 1x1 with @dcausse , the healthcheck's metric "MediaWiki.CirrusSearch.eqiad.backend_failure.memory_issue.count" has aged out of Graphite, as the alert conditions haven't been met for more than 6 months.

However, we still need to alert on these conditions. David suggested switching to a logstash alert, which seems reasonable. We will discuss this further with @EBernhardson and the rest of the team before taking action. In the meantime, I've silenced this alert for the next 21 days.

Thank you for the quick update, from a very quick look I concur that switching to logstash-based metrics/alerts is the right thing to do here; I believe the switch will actually cover two alerts out of modules/role/manifests/elasticsearch/alerts.pp:

monitoring::graphite_threshold { 'search_backend_failure_count':
monitoring::graphite_threshold { 'search_backend_memory_issue_count':

Whereas the other alerts in the file are based on mediawiki metrics themselves rather than "events" we can infer from logs, is that correct ?

Gehel removed bking as the assignee of this task.Jan 30 2024, 4:44 PM

Per IRC conversation with cwhite: "We usually generate metrics from logs then alert on those metrics. An example can be found in alerts/team-sre/mediawiki.yaml log_mediawiki_servergroup_level_channel_doc_count"

Gehel triaged this task as Medium priority.Feb 9 2024, 1:37 PM