⚓ T355795 Fix "requests triggering circuit breakers" Elastic alert

bking created this task.Jan 24 2024, 2:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 24 2024, 2:04 PM

bking changed the task status from Open to In Progress.Jan 24 2024, 3:40 PM

bking claimed this task.

bking edited projects, added Data-Platform-SRE (2024.01.22 - 2024.02.11); removed Data-Platform-SRE.

Per 1x1 with @dcausse , the healthcheck's metric "MediaWiki.CirrusSearch.eqiad.backend_failure.memory_issue.count" has aged out of Graphite, as the alert conditions haven't been met for more than 6 months.

However, we still need to alert on these conditions. David suggested switching to a logstash alert, which seems reasonable. We will discuss this further with @EBernhardson and the rest of the team before taking action. In the meantime, I've silenced this alert for the next 21 days.

bking updated the task description. (Show Details)Jan 24 2024, 3:45 PM

Thank you for the quick update, from a very quick look I concur that switching to logstash-based metrics/alerts is the right thing to do here; I believe the switch will actually cover two alerts out of modules/role/manifests/elasticsearch/alerts.pp:

monitoring::graphite_threshold { 'search_backend_failure_count':
monitoring::graphite_threshold { 'search_backend_memory_issue_count':

Whereas the other alerts in the file are based on mediawiki metrics themselves rather than "events" we can infer from logs, is that correct ?

Gehel removed bking as the assignee of this task.Jan 30 2024, 4:44 PM

lmata subscribed.Jan 31 2024, 3:13 PM

fgiunchedi added a project: Observability-Alerting.Jan 31 2024, 3:13 PM

fgiunchedi mentioned this in T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.Jan 31 2024, 3:22 PM

Per IRC conversation with cwhite: "We usually generate metrics from logs then alert on those metrics. An example can be found in alerts/team-sre/mediawiki.yaml log_mediawiki_servergroup_level_channel_doc_count"

Gehel edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE (2024.01.22 - 2024.02.11).Feb 9 2024, 10:46 AM

Gehel triaged this task as Medium priority.Feb 9 2024, 1:37 PM

bking mentioned this in T358389: Determine cause/fix cloudelastic cross-cluster search missing index errors.Feb 23 2024, 9:20 PM

lmata mentioned this in T357614: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005).Feb 28 2024, 3:26 PM

For reference, the full list of search-related graphite alerts: T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting

bking edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 4 2024, 2:41 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 22 2024, 8:46 AM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:48 AM

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE (2024.03.25 - 2024.04.14).Apr 15 2024, 12:51 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE (2024.04.15 - 2024.05.05).May 3 2024, 3:40 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.27 - 2024.06.16); removed Data-Platform-SRE (2024.05.06 - 2024.05.26).May 24 2024, 12:20 PM

Gehel edited projects, added Data-Platform-SRE; removed Data-Platform-SRE (2024.05.27 - 2024.06.16).May 28 2024, 3:22 PM