|Resolved||None||T169242 Develop Page Content Service for Reading Clients|
|Resolved||• Mholloway||T229286 Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration|
|Resolved||• Mholloway||T170455 Extract the feed endpoints from PCS into a new wikifeeds service|
|Resolved||• Mholloway||T233028 Define SLIs/SLOs for wikifeeds|
For starters let me say that the service owners should be the ones setting the SLIs/SLOs and those should be the ones the team can commit to. They are also not set in stone, but can be amended to better reflect the present reality (e.g. in case the SLOs were set very optimistically and it's impossible to reach them, or so pessimistically that they are always hit with extreme ease despite prolonged outages of the service) as long as they are clearly communicated and advertised (updating the wikipage and an email should suffice)
That being said, SRE can and will help with suggestions about the good SLI/SLO candidates.
Before even setting up SLIs/SLOs it's important to have answers to the 2 questions
- Contact details in case the service suffers an outage
- A person/team to be the service-owner (that can be the same as above)
Then a nice set of questions to answer to help guide the SLI/SLOs selection process would be:
- What is the expected traffic of the system in requests per second? (that's a proposed SLI)
- Does it makes sense to create an SLO based on that? e.g. if we end up normally serving 100 times less than the expected traffic, should it be considered an SLO violation and lead to some action? (e.g. the service is not popular enough and maybe it's not worh it to have around. The inverse might also be true, i.e. we are consistently serving more than what we budgeted for, let's e.g. add capacity before it's too late).
- What's the expected error rate? (proposed SLI again)
- Does it make sense to create an SLO based on that? e.g. errorrate < 1/1000 ?
- What's the expected request latency? (again proposed SLI)
- Does it makes sense to create an SLO based on that? e.g. 500ms <latency < 1500ms ?
The 3 above (error rate, latency, traffic/throughput) are very usual for user-facing systems. They are also already being measured and displayed on https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?refresh=1m&orgId=1 so they make very good candidates for SLIs/SLOs
Add to those any kind of service specific metrics that might exist, e.g. # of feed generations per day, latency to generate feed X etc. Anything that makes sense to the service owner is fair game, as long as it is measurable.
Those numbers should not be looked at a point in time fashion but rather calculated over the course of a time period, probably a quarter in our case. That is, they should be used to guide/gauge firefighting but rather the overall operational nature of the service.
Nitpick: I don't think we can define an SLA, as an agreement is between 2 parties and in this case there is no one to represent the "other" party, aka the users, and agree on the terms on behalf of them. Not to mention the fact that SLAs tend to (but not necessarily) have a legally binding nature and as such legal repercussions. I know the term is used widely in the industry (used to use it myself), but it turns out that for these types of public services it's not the best term, as due to it's wide usage it's overloaded. There's a nice chapter about the distinction between the 3 in the Google SRE book. As far as I am concerned, as long as we end with good enough SLOs and act on their consistent violations in order to resolve the root causes we won't need an SLA.
In other news, you are absolutely right about this not being around too much yet, we are still adopting this. I 'll be adding this to a wikitech page soon.
@Mholloway thanks for drafting this out at https://wikitech.wikimedia.org/wiki/Wikifeeds#Service_level_indicators/objectives_(SLIs/SLOs)
I believe you can remove the draft warning for the SLIs/SLOs section.