Page MenuHomePhabricator

Define SLIs/SLOs for link recommendation service
Closed, ResolvedPublic

Description

The timeouts are still an issue, but that's beta-specific, and infrequent enough that we can live with it.

If that's fine by you, fine by me. The service level is up to the service owner anyway.

And now that I mention that, you should come up with SLIs/SLOs[1] for this service to communicate to the rest of the movement the expected level of service as well as set expectations as to what constitutes an outage or not for this service so that SREs know when and how to react. SRE Service Ops will provide information and a walk through for this.

[1] https://sre.google/sre-book/service-level-objectives/

Event Timeline

kostajh added subscribers: sdkim, MMiller_WMF.

Kicking this back another week.

@MMiller_WMF, this is also a task for you to think about (and @sdkim might have input here too). AIUI, the main question is what kind of guarantees around uptime should we make for the external traffic release of the link recommendation service (https://api.wikimedia.org/wiki/API_reference/Service/Link_recommendation)? I don't think we can make commitments about response times as it varies significantly based on wiki size and article length.

As for the internal service, we want to ensure uptime sufficient for filling up the task pools. Hard to say what that might entail yet, but we'll have a better idea after we've switched on refreshLinkRecommendations.php in production and gathered data via T278411: Statsd implementation of suggested edits task pool

SRE Service Ops will provide information and a walk through for this.

@akosiaris scheduling a meeting with the relevant stakeholders might be difficult; can we do this asynchronously? Do you know of similar Wikimedia services that I could look to for an example of 1) what SLIs/SLOs to define and 2) how/where to document to them?

SRE Service Ops will provide information and a walk through for this.

@akosiaris scheduling a meeting with the relevant stakeholders might be difficult; can we do this asynchronously?

Yes, definitely. In fact, I 'd encourage it with a fallback to a meeting for the parts that require high bandwidth communication, if any arise.

Do you know of similar Wikimedia services that I could look to for an example of 1) what SLIs/SLOs to define and 2) how/where to document to them?

Yes to both. The umbrella page would be: https://wikitech.wikimedia.org/wiki/SLO, and a pretty recent example of a published set of SLOs is at: https://wikitech.wikimedia.org/wiki/SLO/API_Gateway

The worksheet template that should be used to run this is also linked from the umbrella page. Also adding @RLazarus as their are probably the best person to talk to about questions regarding this. Adding @wkandek as well.

One thing that I forgot to point out. Given that the internal vs the external services have different audiences, it probably makes sense to also come up with different SLOs, as the requirements will be different.

@kostajh -- I've not been through a process to determine SLOs before, but I can express what I think would be appropriate from my perspective. Overall, I consider this API to be experimental, in the sense that we don't yet know whether it is truly useful for editing Wikipedia. We're going to learn about that through its usage in our team's feature. Therefore, I don't think it makes sense to strive for ambitious SLOs for external users.

  • For our internal usage, I think the SLOs can be up to the Growth team, and whatever satisfies our needs for the feature (this may change in future instantiations of the feature or on different platforms, such as Android).
  • For external usage, I recommend that we cleave to whatever the minimum recommendations are for our public APIs.

@kostajh -- I've not been through a process to determine SLOs before, but I can express what I think would be appropriate from my perspective. Overall, I consider this API to be experimental, in the sense that we don't yet know whether it is truly useful for editing Wikipedia. We're going to learn about that through its usage in our team's feature. Therefore, I don't think it makes sense to strive for ambitious SLOs for external users.

  • For our internal usage, I think the SLOs can be up to the Growth team, and whatever satisfies our needs for the feature (this may change in future instantiations of the feature or on different platforms, such as Android).
  • For external usage, I recommend that we cleave to whatever the minimum recommendations are for our public APIs.

Tagging the Machine-Learning-Team as they are taking over maintenance of the Add-Link service.

To add to the above: our current process involves caching up to ~25,000 tasks on the MediaWiki side, per wiki. With that many items cached, it seems OK to take a few days to fix an outage affecting the internal release of the mwaddlink service, i.e. it doesn't have to happen immediately.

elukey subscribed.

@akosiaris getting back to this task after some time to set up expectations - is link recommendation handled/owned by service ops, or should the ML team own it? Just to understand where the work should be done :)

@akosiaris getting back to this task after some time to set up expectations - is link recommendation handled/owned by service ops, or should the ML team own it? Just to understand where the work should be done :)

My understanding (T278083#7394333) is that Machine-Learning-Team are the stewards of the link recommendation service since October 2021.

@kostajh hi! We are helping in the training part of the pipeline, but we have no knowledge of the add a link service (that should be the serving part IIUC). I wasn't aware that we took over the maintenance of the whole service, do you have other context/pointers other than T278083#7394333?

@kostajh hi! We are helping in the training part of the pipeline, but we have no knowledge of the add a link service (that should be the serving part IIUC). I wasn't aware that we took over the maintenance of the whole service, do you have other context/pointers other than T278083#7394333?

Ah, I might be misremembering. I've asked @DMburugu to clarify. https://phabricator.wikimedia.org/project/profile/1114/ says "[Growth team is responsible for] the API, the application, the kubernetes deployment, and integration with MediaWiki via GrowthExperiments. Training of datasets is done by Machine Learning team." So let's go with that.

In that case, I guess it would be for Growth team + SRE to decide what SLI/SLO this service should have. Or perhaps that is better done for API Platform team? cc @VirginiaPoundstone

I think that it should be the Growth team's responsibility to set some target availability, to then use it as starting point for a conversation with SRE about adding an SLO for the service.

I think that it should be the Growth team's responsibility to set some target availability, to then use it as starting point for a conversation with SRE about adding an SLO for the service.

I'll start a draft document, and @DMburugu and I will circulate it when it's ready for review.

@kostajh hi! We are helping in the training part of the pipeline, but we have no knowledge of the add a link service (that should be the serving part IIUC). I wasn't aware that we took over the maintenance of the whole service, do you have other context/pointers other than T278083#7394333?

Ah, I might be misremembering. I've asked @DMburugu to clarify. https://phabricator.wikimedia.org/project/profile/1114/ says "[Growth team is responsible for] the API, the application, the kubernetes deployment, and integration with MediaWiki via GrowthExperiments. Training of datasets is done by Machine Learning team." So let's go with that.

In that case, I guess it would be for Growth team + SRE to decide what SLI/SLO this service should have. Or perhaps that is better done for API Platform team? cc @VirginiaPoundstone

SLOs are agreements between 3 parties. The engineering team developing the service, the SRE team supporting the service and the Product Owner of the service (who is ultimately responsible for the service). For the 2 first, Growth + SRE (serviceops specifically) will do fine, who's the product owner? I guess @MMiller_WMF ? OR @DMburugu ?

@kostajh hi! We are helping in the training part of the pipeline, but we have no knowledge of the add a link service (that should be the serving part IIUC). I wasn't aware that we took over the maintenance of the whole service, do you have other context/pointers other than T278083#7394333?

Ah, I might be misremembering. I've asked @DMburugu to clarify. https://phabricator.wikimedia.org/project/profile/1114/ says "[Growth team is responsible for] the API, the application, the kubernetes deployment, and integration with MediaWiki via GrowthExperiments. Training of datasets is done by Machine Learning team." So let's go with that.

In that case, I guess it would be for Growth team + SRE to decide what SLI/SLO this service should have. Or perhaps that is better done for API Platform team? cc @VirginiaPoundstone

SLOs are agreements between 3 parties. The engineering team developing the service, the SRE team supporting the service and the Product Owner of the service (who is ultimately responsible for the service). For the 2 first, Growth + SRE (serviceops specifically) will do fine, who's the product owner? I guess @MMiller_WMF ? OR @DMburugu ?

I think it would be @KStoller-WMF as product manager of the Growth team. (@DMburugu is the engineering manager for the team.)

👍 I'm happy to review once we have an initial draft.
Or, @kostajh, please just let me know if you want me to take the lead on this.

I'll start a draft document, and @DMburugu and I will circulate it when it's ready for review.

Thanks! In case you haven't seen it, our SLOs follow a standard format -- there are some guidelines here that walk you through filling out the template.

No need to worry about writing up the whole SLO right now, and definitely no need to follow the template at this stage -- but, even if you're just putting together some notes on targets, the instructions prompt you with some questions that might be helpful. Ping me any time if I can help!

I'll start a draft document, and @DMburugu and I will circulate it when it's ready for review.

Thanks! In case you haven't seen it, our SLOs follow a standard format -- there are some guidelines here that walk you through filling out the template.

No need to worry about writing up the whole SLO right now, and definitely no need to follow the template at this stage -- but, even if you're just putting together some notes on targets, the instructions prompt you with some questions that might be helpful. Ping me any time if I can help!

Thanks. I made a draft at https://wikitech.wikimedia.org/wiki/SLO/linkrecommendation @RLazarus maybe we could discuss here, or in the talk page there, or on IRC about what some realistic SLIs/SLOs would be. I think we want Availability SLI: The percentage of all requests receiving a non-error response, defined as "HTTP status code 200" to be somewhere around 99% while maybe 95% is more realistic, but maybe I am missing other things that we should have. Could you please look over the draft and let me know if there are other considerations we should have, based on how I've described the service and its three distinct deployments (internal, external, dataset loading)?

Good draft! I'll get back to you with some comments within a couple of days.

Looks good! Some high-level questions, based on what I understand from reading the draft and the Add_Link page.

GET and POST requests to the external service are nice-to-have. The POST requests to the internal service are the main thing that are important to the Growth team.

Just to confirm, it sounds like only the internal service should be covered by the SLO. The external service—at least for now—isn't covered, meaning nobody's committing to fix it urgently if it breaks (but nothing's stopping us from adding an SLO for it later on). Is that right?

The latency is not that important to us, because the primary consumer [...] runs via cron

Great insight. It also suggests that linkrecommendations availability per se isn't the thing we care about: if the cron job gets a 500, it can just retry, and if it works the second time, nobody really minds, right? (At least, not in the same way that we care when we serve a 500 directly to the user.)

Instead, maybe the SLI that really matters is a freshness metric: something like, "X% of Special:Homepage requests are rendered from data that came from the linkrecommendations service less than Y minutes ago." That SLI is more complex to monitor, and it covers both the internal service and the maintenance script itself (e.g. we might miss the SLO if the internal service fails or if the cronjob breaks) but it corresponds more directly to the user impact.

In a multi-team environment with distributed responsibilities, it might be helpful to have that overarching freshness SLO, and specific SLOs for each component subsystem (like the backend availability SLO you propose here, plus something that covers the maintenance script success ratio, etc.) in order to help the teams keep track of what they expect of each other. In particular, the teams would agree on a linkrecommendations availability SLO calculated so that as long as we meet that target, then we're reasonably confident of meeting our overall freshness goal to satisfy the user. But when the same team is responsible for the whole system end-to-end, that kind of subsystem breakdown can add more complexity than is really justified.

So, my question to you is, first of all am I thinking about the service in basically the right way? And, if so, what's the most directly user-relevant SLI that we can report on, given the monitoring data we have available? (Often the way this goes is something like, "we care about metric X, but we can't get instrumentation for it directly, so we monitor metric Y as a proxy," and that's reasonable if it's where we're at.)

Most engineers on the Growth team have experience deploying and troubleshooting issues. Add_Link is reasonably documented. That said, there are multiple moving parts and it takes a while to understand the full sequence of events.

This part's tricky: the intent of the question is to assess (very roughly) what kind of time it takes to fix problems after they appear in production, so that we can work out what kind of availability is achievable. We can base the answer on the Growth team's expertise—but then we're making the assumption that the Growth team will have to be involved in fixing those problems, even if they appear at 3 AM on a Saturday. Or we can work with the assumption that SREs will try our best to understand and fix whatever's broken, in which case we need to make this assessment based on how accessible the service is to someone who hasn't worked with it before.

In either case we don't need a precise estimate in minutes and seconds (that isn't how troubleshooting works!) but the point is to establish what kind of support model we're talking about here, and what expectations we can reasonably make as a result. @akosiaris probably has some thoughts on this; it'll need to be something the teams work out together.

These are all big questions! Let me know if you'd like to meet and discuss; I figure a real-time conversation might be the most efficient way to get into it, but I wanted to lay out some initial thoughts here.

@kostajh is on break right now so I'll answer where I can.

Just to confirm, it sounds like only the internal service should be covered by the SLO. The external service—at least for now—isn't covered, meaning nobody's committing to fix it urgently if it breaks (but nothing's stopping us from adding an SLO for it later on). Is that right?

Yes, we can add the SLO later on. Something to note is that the SLO is a living document and can be updated as the case demands. So we don't mind committing to some items now and then revising this down the road.

These are all big questions! Let me know if you'd like to meet and discuss; I figure a real-time conversation might be the most efficient way to get into it, but I wanted to lay out some initial thoughts here.

We're open to scheduling some time once Kosta returns to discuss this and possibly update the doc in real time as well as we discuss

Sounds good! Feel free to put something on the calendar when he's back.

Thanks for meeting! Summing up what we talked about, for the record:

  • Yes, the external service is not covered by the SLO, and we'll add some language to the document making that clear. Obviously that can change in the future as usage evolves.
  • We'll keep internal-service availability (i.e. 500s) as the only SLI for the short term. There are still some failure modes that wouldn't be covered: notably, if capacity issues prevented the cron job from completing in a timely manner as we scale it to include most Wikipedias, then link recommendations could be out of date even if the backend were 100% available. That's imperfect: ideally, the SLO should be violated if and only if the user experience is degraded. But the error rate is the best candidate SLI of the metrics we have available today, so we'll go forward with it.
  • There are other use cases for monitoring data freshness, so once that data becomes available, we might rework the SLO to be based on freshness. That would effectively cover the "cron slowness" scenario from the previous bullet, as well as the scenario where the cron fails to run (for any of the usual plausible reasons, like a config issue or a hardware failure). Broadly speaking, that would mean the SLO covers both the internal service and the dataset loader cron job, rather than just the internal service component as it does today.
  • The Growth team would be comfortable even if link recommendations are out of date by several days, so we're currently talking about a 95% availability SLO for the backend (that is, allowing for about 4½ days of downtime per quarter). We're agreed that at that level, paging alerts aren't appropriate, and SRE support can be limited to working hours.
  • @kostajh will update the SLO document, but all the major questions are now pretty much resolved.

Kosta, let me know if I missed anything; otherwise, I'm happy to take one more look after the edits are in, and then we can call this complete.

Thanks for meeting! Summing up what we talked about, for the record:

  • Yes, the external service is not covered by the SLO, and we'll add some language to the document making that clear. Obviously that can change in the future as usage evolves.
  • We'll keep internal-service availability (i.e. 500s) as the only SLI for the short term. There are still some failure modes that wouldn't be covered: notably, if capacity issues prevented the cron job from completing in a timely manner as we scale it to include most Wikipedias, then link recommendations could be out of date even if the backend were 100% available. That's imperfect: ideally, the SLO should be violated if and only if the user experience is degraded. But the error rate is the best candidate SLI of the metrics we have available today, so we'll go forward with it.
  • There are other use cases for monitoring data freshness, so once that data becomes available, we might rework the SLO to be based on freshness. That would effectively cover the "cron slowness" scenario from the previous bullet, as well as the scenario where the cron fails to run (for any of the usual plausible reasons, like a config issue or a hardware failure). Broadly speaking, that would mean the SLO covers both the internal service and the dataset loader cron job, rather than just the internal service component as it does today.

✅ The use case is described a bit in T316079#8534093.

  • The Growth team would be comfortable even if link recommendations are out of date by several days, so we're currently talking about a 95% availability SLO for the backend (that is, allowing for about 4½ days of downtime per quarter). We're agreed that at that level, paging alerts aren't appropriate, and SRE support can be limited to working hours.
  • @kostajh will update the SLO document, but all the major questions are now pretty much resolved.

Kosta, let me know if I missed anything; otherwise, I'm happy to take one more look after the edits are in, and then we can call this complete.

That looks good to me. I'll update the document soon™. (Hopefully by end of week.)

Thanks for meeting! Summing up what we talked about, for the record:

  • Yes, the external service is not covered by the SLO, and we'll add some language to the document making that clear. Obviously that can change in the future as usage evolves.
  • We'll keep internal-service availability (i.e. 500s) as the only SLI for the short term. There are still some failure modes that wouldn't be covered: notably, if capacity issues prevented the cron job from completing in a timely manner as we scale it to include most Wikipedias, then link recommendations could be out of date even if the backend were 100% available. That's imperfect: ideally, the SLO should be violated if and only if the user experience is degraded. But the error rate is the best candidate SLI of the metrics we have available today, so we'll go forward with it.
  • There are other use cases for monitoring data freshness, so once that data becomes available, we might rework the SLO to be based on freshness. That would effectively cover the "cron slowness" scenario from the previous bullet, as well as the scenario where the cron fails to run (for any of the usual plausible reasons, like a config issue or a hardware failure). Broadly speaking, that would mean the SLO covers both the internal service and the dataset loader cron job, rather than just the internal service component as it does today.

✅ The use case is described a bit in T316079#8534093.

  • The Growth team would be comfortable even if link recommendations are out of date by several days, so we're currently talking about a 95% availability SLO for the backend (that is, allowing for about 4½ days of downtime per quarter). We're agreed that at that level, paging alerts aren't appropriate, and SRE support can be limited to working hours.
  • @kostajh will update the SLO document, but all the major questions are now pretty much resolved.

Kosta, let me know if I missed anything; otherwise, I'm happy to take one more look after the edits are in, and then we can call this complete.

That looks good to me. I'll update the document soon™. (Hopefully by end of week.)

I've updated the document, is anything else needed in order to resolve this task?

Change 913691 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/grafana-grizzly@master] Add linkrecommendation SLO dashboard

https://gerrit.wikimedia.org/r/913691

https://gerrit.wikimedia.org/r/913691 will add an SLO dashboard giving us a quarterly view, and then I think we're all set! I'll take care of getting that patch shepherded through, so claiming the task for the last bit of work.

Change 913691 merged by RLazarus:

[operations/grafana-grizzly@master] Add linkrecommendation SLO dashboard

https://gerrit.wikimedia.org/r/913691

The dashboard is done: https://grafana.wikimedia.org/d/slo-Linkrecommendation/linkrecommendation-slo-s?from=1677628800000&to=1685577540000

To use, make sure the time picker is set to an SLO reporting quarter (e.g. that link goes to the current quarter, 2023-03-01 through 2023-05-31).

Then the percentages on the left are the error budget remaining -- it starts at 100% at the beginning of the quarter, and if it falls below 0% at the end, the SLO is violated. Right now, two thirds of the way through the reporting period, we would be right on track if we had about 33% of budget remaining, and in fact we have plenty more than that.

The graph on the right is just the SLI we're using to calculate that budget, in this case the same error rate you're tracking elsewhere. Spikes represent outages, so it can be useful to identify when the error budget was spent (and to cross-check against your other dashboards if you like, to make sure we're measuring the thing we think we're measuring).

@kostajh One last question: I followed the same pattern we've used with other services and broke this out into separate numbers for eqiad and codfw... but come to think of it, for linkrecommendations that might have been wrong, and the metric we care about is actually just the total number of errors divided by the total number of requests, without regard for which data center they came from. Does that sound right to you?

@kostajh One last question: I followed the same pattern we've used with other services and broke this out into separate numbers for eqiad and codfw... but come to think of it, for linkrecommendations that might have been wrong, and the metric we care about is actually just the total number of errors divided by the total number of requests, without regard for which data center they came from. Does that sound right to you?

I'm not sure. I think we only care about the numbers from the active datacenter, e.g. at the moment codfw could be completely broken, and it wouldn't necessarily be a problem from an SRE perspective. (It would be something that Growth team manages as part of chores and we'd want to fix it of course.)

I'm a bit surprised the percentages aren't higher. Is it possible the MariaDB read-only errors (T308133#8788082) are included in this count? Those should go away when we take care of T334928: linkrecommendation: Cron job should only run with eqiad deployment but I'm unsure of the schedule for implementing that change.

Okay, cool -- I'll switch the dashboard over to a combined metric today. (Optional statistical sidebar: One nice thing about a combined "sum of good requests / sum of total requests" metric is, it naturally weights the two data centers by their traffic -- meaning when one DC is inactive, and so doesn't get any requests, it doesn't affect the SLO outcome. That makes it the right choice for many services where the user doesn't really care where the server is, and means that for an active/passive service we get one continuous timeseries across DC switchovers.)

And don't worry about the current percentages, they're in good shape: remember, those are error budget remaining. So a big green 90% at the end of the quarter doesn't mean we're serving 10% errors; it means we're serving only a tenth as many errors as the SLO allows. Anything over zero is healthy, so if we're on track to end at 80-90% there's nothing to worry about.

On MariaDB read-only errors -- the underlying monitoring is just based on non-2xx responses from the internal linkrecommendation service, same as the written SLO. So as long as MariaDB errors in the cron job don't lead to 500s from the internal service (sounds like they don't?) they don't show up on that dashboard.

Okay, cool -- I'll switch the dashboard over to a combined metric today. (Optional statistical sidebar: One nice thing about a combined "sum of good requests / sum of total requests" metric is, it naturally weights the two data centers by their traffic -- meaning when one DC is inactive, and so doesn't get any requests, it doesn't affect the SLO outcome. That makes it the right choice for many services where the user doesn't really care where the server is, and means that for an active/passive service we get one continuous timeseries across DC switchovers.)

And don't worry about the current percentages, they're in good shape: remember, those are error budget remaining. So a big green 90% at the end of the quarter doesn't mean we're serving 10% errors; it means we're serving only a tenth as many errors as the SLO allows. Anything over zero is healthy, so if we're on track to end at 80-90% there's nothing to worry about.

Ack, thanks for explaining.

On MariaDB read-only errors -- the underlying monitoring is just based on non-2xx responses from the internal linkrecommendation service, same as the written SLO. So as long as MariaDB errors in the cron job don't lead to 500s from the internal service (sounds like they don't?) they don't show up on that dashboard.

Ah, right, the errors (example) are associated with the cron job (which isn't part of the SLI) and those indeed don't log as 500s.

Change 916680 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/grafana-grizzly@master] Combine linkrecommendation SLO metrics into one cross-datacenter value

https://gerrit.wikimedia.org/r/916680

Change 916680 merged by RLazarus:

[operations/grafana-grizzly@master] Combine linkrecommendation SLO metrics into one cross-datacenter value

https://gerrit.wikimedia.org/r/916680

Done, looking good!