Page MenuHomePhabricator

Define Product-Level Service Level Objectives (SLOs) for Experimentation Lab
Closed, ResolvedPublic8 Estimated Story Points

Description

A sub task of T382107: [Epic] Establish Product-Level Service Level Objectives (SLOs) for Experimentation Lab to track the definition of the product level SLOs.

User Story

As a Product Manager,
I want to define clear, measurable SLO targets for availability, latency, error rates, and data quality,
So that we can ensure the Experimentation Lab meets customer expectations and maintain high product reliability

Acceptance criteria
  • Define availability metric/uptime target with specified measurement window ( with performance)
  • Define error rate metrics (with engineering manager)
    • Define error categories (5xx, 4xx, validation errors)
    • Specified error rate targets (<0.1% for critical paths)
  • Define latency metrics (with engineering manager)
    • Define p50, p95, p99 targets for each critical endpoint (with measurement points)
  • Define data quality metrics (with data quality engineer / engineering manager?)
    • Data quality dimensions framework (with data scientist)
    • Define completeness metrics (>99.5%) with specified accuracy targets
    • Define freshness requirements
  • Thresholds align with customer expectations
  • Business impact is considered
  • Are technically feasible
  • include error budgets
  • are reviewed by key stakeholders

Event Timeline

VirginiaPoundstone added a subscriber: RLazarus.

Hello @RLazarus do you have specific guidance for PMs when defining the SLO targets that I should consider here? Also, any feedback on how the task and epic are defined is appreciated. :)

I'm trying but failing to find time to update the SLO draft. If I'm not done with it by the end of this week, I'll need some help.

Milimetric set the point value for this task to 8.May 27 2025, 12:04 PM

Draft SLO published at https://wikitech.wikimedia.org/wiki/SLO/Experimentation_lab

This needs updates and then review.

@phuedx: can you own the Operational section? Assuming you agree with the rest of what we wrote above, and the SLIs that we're tracking, we need dashboards that actually track that stuff and alerts to go somewhere.

After that, anyone and everyone should review as this is a team commitment.