We ran through a cluster-wide restart of the eqiad production Elastic cluster yesterday, which triggered a few "CirrusBackendErrorRateTooHigh" alerts. ( Dashboard link ). We just created this alert in T363609 , as part of a post-incident follow-up. Thus, we believe that these restarts were always disruptive, we just weren't monitoring their impact.
The alert triggers when more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period, so we do think this is a valuable alert. We intend to define SLOs around this value, so we need to confirm that rolling operations won't blow through our SLO budget or adjust our SLO metrics. If they are too disruptive, we need to figure out a way to reduce their impact. Creating this ticket to:
- Confirm or deny that we do plan on setting an availability SLO target around the metric "more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period". @RKemper should be able to confirm/deny.
- If it's too disruptive, figure out a way to make rolling restarts less disruptive.