[L] Define SLOs/SLAs for image-suggestions pipelines
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	Jun 13 2023, 12:59 PM

Description

We need to create and monitor some measures of what it means for the image-suggestions pipelines to be running ok

ATM we're monitoring and alerting on

too long since data pipeline metrics were written (data pipeline metrics get written at the end of the pipeline process, so this means the pipeline has failed somewhere along the way)
a large change in the number of suggestions
a failure to push pipeline metrics (this has never happened)

If we get alerted we have a runbook

We also have agreement on allowed revert rates on images that have been added because of our suggestions.

What we don't have is:

agreement on how old the suggestions are allowed to be (e.g. suggestions must be less then a week old for 50 weeks in the year)
agreement on expected/allowed rejection rates for suggestions (from Growth's tooling)
agreement on uptime for the suggestions api
anything else?

SLO breakdown

We depend on the following teams and need to figure out what guarantees they can make. In order of priority:

Data Engineering
- upstream data dependencies - missing wmf.wikidata_item_page_link snapshots have been the most frequent and critical issue so far
- data pipeline infrastructure
  - an-airflow1004.eqiad.wmnet Airflow machine
  - DAGs repo
  - GitLab CI tools
Search - discovery.cirrus_index_without_content/cirrus_replica=codfw snapshots
Enterprise - HTML dumps

The following teams depend on us and we need to know their minimum requirements:

Search
Growth
Android
ourselves - notifications & MediaSearch
iOS - maybe, we should check whether they have something

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T340437 [EPIC] Data pipelines maintenance
		Resolved		MarkTraceur	T338949 [L] Define SLOs/SLAs for image-suggestions pipelines

Event Timeline

Cparle created this task.Jun 13 2023, 12:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2023, 12:59 PM

CBogen added a project: Structured-Data-Backlog.Jun 13 2023, 1:09 PM

AUgolnikova-WMF moved this task from Triage to Current Work on the Structured-Data-Backlog board.Jun 14 2023, 12:28 PM

AUgolnikova-WMF edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

AUgolnikova-WMF moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

MarkTraceur renamed this task from Define SLOs/SLAs for image-suggestions pipelines to [L] Define SLOs/SLAs for image-suggestions pipelines.Jun 14 2023, 4:57 PM

MarkTraceur moved this task from Ready for Estimation to Ready for Development on the Structured-Data-Backlog (Current Work) board.

• mfossati added parent tasks: T296814: [EPIC] Article-level image suggestions data pipeline, T311814: [EPIC] Section-level image suggestions data pipeline.Jun 15 2023, 10:20 AM

AUgolnikova-WMF edited projects, added Structured-Data-Backlog; removed Structured-Data-Backlog (Current Work).Jun 19 2023, 4:36 PM

AUgolnikova-WMF mentioned this in T340437: [EPIC] Data pipelines maintenance .Jun 26 2023, 11:42 AM

AUgolnikova-WMF added a parent task: T340437: [EPIC] Data pipelines maintenance .Jun 26 2023, 11:46 AM

AUgolnikova-WMF removed parent tasks: T311814: [EPIC] Section-level image suggestions data pipeline, T296814: [EPIC] Article-level image suggestions data pipeline.Jun 26 2023, 11:50 AM

AUgolnikova-WMF moved this task from Triage to SDAW Search Improvements on the Structured-Data-Backlog board.Jun 26 2023, 4:32 PM

AUgolnikova-WMF moved this task from SDAW Search Improvements to Current Work on the Structured-Data-Backlog board.

AUgolnikova-WMF edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.

• mfossati mentioned this in T345141: No ALIS for 2023-08-14 snapshot.Sep 1 2023, 4:35 PM

I suggest to take this incident into account: T345141: No ALIS for 2023-08-14 snapshot

• mfossati mentioned this in T345188: Add Image: all wikis ran out of image recommendations.Sep 8 2023, 8:46 AM

KStoller-WMF subscribed.Sep 8 2023, 3:00 PM

AUgolnikova-WMF moved this task from Ready for Development to Epics on the Structured-Data-Backlog (Current Work) board.Oct 3 2023, 3:59 PM

AUgolnikova-WMF moved this task from Epics to Ready for Development on the Structured-Data-Backlog (Current Work) board.

MarkTraceur claimed this task.Feb 20 2024, 4:35 PM

MarkTraceur moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

• mfossati updated the task description. (Show Details)Mar 1 2024, 1:04 PM

• mfossati updated the task description. (Show Details)

Update: As of now the suggestion for an SLA we're working with is the following.

The Structured Content team will ensure that at any given point, image suggestions provided by our API will be, at maximum, 21 days old.
The Structured Content team will ensure that, in any given one-week period, the internal image recommendation API will have, at minimum, 95% uptime, allowing for a maximum 8.4 hours of downtime.

I've shared this with a few people already but now would like wider feedback on it, and if nobody objects to this proposal, it will become our agreement with our downstream clients.

If either of these points are not being met, resolving any issue causing downtime would become our highest priority immediately.

Likely impact is pretty low ultimately, but gives us a clear line that, if crossed, escalates the situation automatically. We have had no issues causing significant downtime of the API, and have had only a few issues causing multiple weeks of disruption in suggestion generation.

Thanks everyone for your patience and input on this matter!

KStoller-WMF added subscribers: Sgs, DMburugu.Mar 12 2024, 3:59 PM

MarkTraceur moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.Mar 18 2024, 4:10 PM

OK, once again and for the final time, the SLA we'll be working with is the following:

The Structured Content team will ensure that at any given point, image suggestions provided by our API will be, at maximum, 21 days old.

The Structured Content team will ensure that, in any given one-week period, the internal image recommendation API will have, at minimum, 95% uptime, allowing for a maximum 8.4 hours of downtime.

If either of these points are not being met, resolving any issue causing downtime would become our highest priority immediately.

Thanks to the folks who have helped with this process, and let us know if anything changes that would require a modification to this!

[L] Define SLOs/SLAs for image-suggestions pipelinesClosed, ResolvedPublicActions

Description

SLO breakdown

Related ObjectsSearch...

Event Timeline

[L] Define SLOs/SLAs for image-suggestions pipelines
Closed, ResolvedPublic
Actions

Related Objects
Search...