Adopt SLIs / SLOs for sessionstore
Open, Stalled, LowPublic
Actions

Assigned To

None

Authored By

	akosiaris
	Jun 29 2020, 12:40 PM

Description

Followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes.

Sessionstore is critical for our infrastructure as it became evident from the sessionstore incident. We should be adopting industry best practices at managing this service, including adopting SLIs/SLOs. It's conceivable that had we SLIs and SLOS (and proper alerting on them) we could have prevented the incident in the first place.

Due to the nature of the service, and our current state of SLI/SLO adoption we can probably stall this for a while, filing task so that it doesn't get forgotten.

kask is stateless in itself, but the sessionstore service, of which kask is 1 component is stateful. This makes it a tad more difficult to set meaningful SLIs and SLOs for it so we should first gain some confidence in other services.

Related Objects

Mentioned In: T274665: Design and implement SLO Dashboard tooling
Mentioned Here: T254916: Define base set of SLOs covering API Gateway

Event Timeline

akosiaris created this task.Jun 29 2020, 12:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 29 2020, 12:40 PM

akosiaris changed the task status from Open to Stalled.Jun 29 2020, 12:40 PM

akosiaris triaged this task as Low priority.

@akosiaris Thanks for putting this together. We're very excited to have SLOs around all our services. For the API Gateway work, Hugh, Giuseppe, Wolfgang and I discussed having a basic set of SLOs in place for Gateway and ideally for one of the services that back it under T254916.

The initial areas we were going to look at would be throughput, latency, error budget and availability. I know the dimensions we consider will vary depending on the component or service we're covering but are there general guidelines, process or methodology we could already be reviewing?

In T256629#6264055, @WDoranWMF wrote:

@akosiaris Thanks for putting this together. We're very excited to have SLOs around all our services. For the API Gateway work, Hugh, Giuseppe, Wolfgang and I discussed having a basic set of SLOs in place for Gateway and ideally for one of the services that back it under T254916.

The initial areas we were going to look at would be throughput, latency, error budget and availability. I know the dimensions we consider will vary depending on the component or service we're covering but are there general guidelines, process or methodology we could already be reviewing?

That sounds great. I look forward to those discussions as I expect the API gateway to be one of the most difficult services to adopt SLIs and SLOs due to its nature (e.g. specifying an SLO for latency on an endpoint that is powered by a dependent service, immediately makes the latency SLO dependent on something else outside of the API gateway's control and very difficult to come with a proper SLO, especially if the dependent service doesn't have one).

The process/methodology is going to be an OKR of the ServiceOps team for the upcoming quarter, so it's still a WIP, but we 'll make sure to update you on it.

Awesome, thanks. Yeah, Giuseppe called out that complexity - we spoke about first looking at the services backing the gateway and deriving from there. When we get to the task it should be an open discussion so should be an interesting exercise.

• eprodromou moved this task from Inbox to Tracking/Watching on the Platform Engineering board.Jul 7 2020, 8:53 PM

herron mentioned this in T274665: Design and implement SLO Dashboard tooling.Feb 12 2021, 7:25 PM

Volans added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 21 2023, 11:22 AM

Volans moved this task from Backlog to Not an SRE issue on the SRE-Sprint-Week-Sustainability-March2023 board.

Adopt SLIs / SLOs for sessionstoreOpen, Stalled, LowPublicActions

Description

Related Objects

Event Timeline

Adopt SLIs / SLOs for sessionstore
Open, Stalled, LowPublic
Actions