Page MenuHomePhabricator

Adopt SLIs / SLOs for sessionstore
Open, Stalled, LowPublic

Description

Followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes.

Sessionstore is critical for our infrastructure as it became evident from the sessionstore incident. We should be adopting industry best practices at managing this service, including adopting SLIs/SLOs. It's conceivable that had we SLIs and SLOS (and proper alerting on them) we could have prevented the incident in the first place.

Due to the nature of the service, and our current state of SLI/SLO adoption we can probably stall this for a while, filing task so that it doesn't get forgotten.

kask is stateless in itself, but the sessionstore service, of which kask is 1 component is stateful. This makes it a tad more difficult to set meaningful SLIs and SLOs for it so we should first gain some confidence in other services.

Event Timeline

akosiaris changed the task status from Open to Stalled.Jun 29 2020, 12:40 PM
akosiaris triaged this task as Low priority.

@akosiaris Thanks for putting this together. We're very excited to have SLOs around all our services. For the API Gateway work, Hugh, Giuseppe, Wolfgang and I discussed having a basic set of SLOs in place for Gateway and ideally for one of the services that back it under T254916.

The initial areas we were going to look at would be throughput, latency, error budget and availability. I know the dimensions we consider will vary depending on the component or service we're covering but are there general guidelines, process or methodology we could already be reviewing?

@akosiaris Thanks for putting this together. We're very excited to have SLOs around all our services. For the API Gateway work, Hugh, Giuseppe, Wolfgang and I discussed having a basic set of SLOs in place for Gateway and ideally for one of the services that back it under T254916.

The initial areas we were going to look at would be throughput, latency, error budget and availability. I know the dimensions we consider will vary depending on the component or service we're covering but are there general guidelines, process or methodology we could already be reviewing?

That sounds great. I look forward to those discussions as I expect the API gateway to be one of the most difficult services to adopt SLIs and SLOs due to its nature (e.g. specifying an SLO for latency on an endpoint that is powered by a dependent service, immediately makes the latency SLO dependent on something else outside of the API gateway's control and very difficult to come with a proper SLO, especially if the dependent service doesn't have one).

The process/methodology is going to be an OKR of the ServiceOps team for the upcoming quarter, so it's still a WIP, but we 'll make sure to update you on it.

Awesome, thanks. Yeah, Giuseppe called out that complexity - we spoke about first looking at the services backing the gateway and deriving from there. When we get to the task it should be an open discussion so should be an interesting exercise.