Page MenuHomePhabricator

[L] Define SLOs/SLAs for image-suggestions pipelines
Closed, ResolvedPublic

Description

We need to create and monitor some measures of what it means for the image-suggestions pipelines to be running ok

ATM we're monitoring and alerting on

  • too long since data pipeline metrics were written (data pipeline metrics get written at the end of the pipeline process, so this means the pipeline has failed somewhere along the way)
  • a large change in the number of suggestions
  • a failure to push pipeline metrics (this has never happened)

If we get alerted we have a runbook

We also have agreement on allowed revert rates on images that have been added because of our suggestions.

What we don't have is:

  • agreement on how old the suggestions are allowed to be (e.g. suggestions must be less then a week old for 50 weeks in the year)
  • agreement on expected/allowed rejection rates for suggestions (from Growth's tooling)
  • agreement on uptime for the suggestions api
  • anything else?

SLO breakdown

We depend on the following teams and need to figure out what guarantees they can make. In order of priority:

  1. Data Engineering
    • upstream data dependencies - missing wmf.wikidata_item_page_link snapshots have been the most frequent and critical issue so far
    • data pipeline infrastructure
  2. Search - discovery.cirrus_index_without_content/cirrus_replica=codfw snapshots
  3. Enterprise - HTML dumps

The following teams depend on us and we need to know their minimum requirements:

  • Search
  • Growth
  • Android
  • ourselves - notifications & MediaSearch
  • iOS - maybe, we should check whether they have something

Event Timeline

MarkTraceur renamed this task from Define SLOs/SLAs for image-suggestions pipelines to [L] Define SLOs/SLAs for image-suggestions pipelines.Jun 14 2023, 4:57 PM

Update: As of now the suggestion for an SLA we're working with is the following.

  • The Structured Content team will ensure that at any given point, image suggestions provided by our API will be, at maximum, 21 days old.
  • The Structured Content team will ensure that, in any given one-week period, the internal image recommendation API will have, at minimum, 95% uptime, allowing for a maximum 8.4 hours of downtime.

I've shared this with a few people already but now would like wider feedback on it, and if nobody objects to this proposal, it will become our agreement with our downstream clients.

If either of these points are not being met, resolving any issue causing downtime would become our highest priority immediately.

Likely impact is pretty low ultimately, but gives us a clear line that, if crossed, escalates the situation automatically. We have had no issues causing significant downtime of the API, and have had only a few issues causing multiple weeks of disruption in suggestion generation.

Thanks everyone for your patience and input on this matter!

OK, once again and for the final time, the SLA we'll be working with is the following:

  • The Structured Content team will ensure that at any given point, image suggestions provided by our API will be, at maximum, 21 days old.
  • The Structured Content team will ensure that, in any given one-week period, the internal image recommendation API will have, at minimum, 95% uptime, allowing for a maximum 8.4 hours of downtime.

If either of these points are not being met, resolving any issue causing downtime would become our highest priority immediately.

Thanks to the folks who have helped with this process, and let us know if anything changes that would require a modification to this!