Check SLO impact of Elastic cluster rolling restarts/mitigate if necessary
Open, HighPublic
Actions

Assigned To

None

Authored By

	bking
	Tue, May 14, 1:53 PM

Description

We ran through a cluster-wide restart of the eqiad production Elastic cluster yesterday, which triggered a few "CirrusBackendErrorRateTooHigh" alerts. ( Dashboard link ). We just created this alert in T363609 , as part of a post-incident follow-up. Thus, we believe that these restarts were always disruptive, we just weren't monitoring their impact.

The alert triggers when more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period, so we do think this is a valuable alert. We intend to define SLOs around this value, so we need to confirm that rolling operations won't blow through our SLO budget or adjust our SLO metrics. If they are too disruptive, we need to figure out a way to reduce their impact. Creating this ticket to:

Confirm or deny that we do plan on setting an availability SLO target around the metric "more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period". @RKemper should be able to confirm/deny.
If it's too disruptive, figure out a way to make rolling restarts less disruptive.

Related Objects

Mentioned Here: T363609: Elasticsearch: Alert on upstream errors for MW API

Event Timeline

bking created this task.Tue, May 14, 1:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, May 14, 1:53 PM

bking updated the task description. (Show Details)Tue, May 14, 4:46 PM

Gehel triaged this task as High priority.Fri, May 17, 12:43 PM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel moved this task from Scratch to 2024.05.06 - 2024.05.26 on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE.

Gehel edited projects, added Data-Platform-SRE (2024.05.27 - 2024.06.16); removed Data-Platform-SRE (2024.05.06 - 2024.05.26).Fri, May 24, 12:20 PM

Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.Mon, May 27, 12:47 PM

bking updated the task description. (Show Details)Tue, May 28, 7:09 PM

Confirm or deny that we do plan on setting an availability SLO target around the metric "more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period".

No, our SLO would be around quarterly (90 day) availability and it would be something like >= 99% or 99.5% of requests succeed in a quarter. We won't be setting SLOs on short time windows, although we can certainly have normal alerts. Most recently we adjusted to alert if >.5% of envoy upstream requests fail for at least 5 minutes: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1031543.

I still have a TODO on my end to add placeholder text for the availability SLO proposal to the WDQS slo documentation; I'll comment back here when that is in place.

Check SLO impact of Elastic cluster rolling restarts/mitigate if necessaryOpen, HighPublicActions

Description

Related Objects

Event Timeline

Check SLO impact of Elastic cluster rolling restarts/mitigate if necessary
Open, HighPublic
Actions