Page MenuHomePhabricator

Check SLO impact of Elastic cluster rolling restarts/mitigate if necessary
Open, HighPublic

Description

We ran through a cluster-wide restart of the eqiad production Elastic cluster yesterday, which triggered a few "CirrusBackendErrorRateTooHigh" alerts. ( Dashboard link ). We just created this alert in T363609 , as part of a post-incident follow-up. Thus, we believe that these restarts were always disruptive, we just weren't monitoring their impact.

The alert triggers when more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period, so we do think this is a valuable alert. We intend to define SLOs around this value, so we need to confirm that rolling operations won't blow through our SLO budget or adjust our SLO metrics. If they are too disruptive, we need to figure out a way to reduce their impact. Creating this ticket to:

  • Confirm or deny that we do plan on setting an availability SLO target around the metric "more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period". @RKemper should be able to confirm/deny.
  • If it's too disruptive, figure out a way to make rolling restarts less disruptive.

Event Timeline

Gehel triaged this task as High priority.Fri, May 17, 12:43 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.
Gehel moved this task from Scratch to 2024.05.06 - 2024.05.26 on the Data-Platform-SRE board.

Confirm or deny that we do plan on setting an availability SLO target around the metric "more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period".

No, our SLO would be around quarterly (90 day) availability and it would be something like >= 99% or 99.5% of requests succeed in a quarter. We won't be setting SLOs on short time windows, although we can certainly have normal alerts. Most recently we adjusted to alert if >.5% of envoy upstream requests fail for at least 5 minutes: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1031543.

I still have a TODO on my end to add placeholder text for the availability SLO proposal to the WDQS slo documentation; I'll comment back here when that is in place.