Page MenuHomePhabricator

Defined and measured SLO for every production service - COMPLETE
Closed, ResolvedPublic

Event Timeline

calbon renamed this task from proposed: observability to Proposed Goal: One SLO for every important service.Jul 12 2023, 2:25 PM
calbon renamed this task from Proposed Goal: One SLO for every important service to Proposed Goal: Defined and measured SLO for every production service.
calbon renamed this task from Proposed Goal: Defined and measured SLO for every production service to Goal: Defined and measured SLO for every production service.Jul 12 2023, 2:54 PM
calbon renamed this task from Goal: Defined and measured SLO for every production service to Defined and measured SLO for every production service.Jul 17 2023, 7:28 PM

Update:

  • Found basic SLIs (latency and HTTP 2xx requests) to use, and their related thresholds. We will start from a baseline of 95% for new services, to refine them towards 99%+ if needed. The idea is that new services will be more experimental and it will take time to adjust them.
  • For Revscoring and RR Language Agnostic we'll start from 98%, since those services are meant to be more stable and less prone to get experimental changes.
  • We followed what SRE suggested and created a formal SLO [[ SLO/Lift Wing - Wikitech | page ]] , plus ad-hoc Grafana Dashboard using their templates.
  • The Grizzly templates are almost ready to go, currently under code review.
  • Ores Legacy is next.

Once we merge a code review, we should be good. New services should have an SLO with at least 95% availability and 95% of requests below a latency. We can refine over time.

The SLO and error budget is calculated in a time window, 3 monitor with one month lagging behind quarter ending to allow us to plan for the next quarter.

calbon renamed this task from Defined and measured SLO for every production service to Defined and measured SLO for every production service - COMPLETE.Sep 26 2023, 2:30 PM

@calbon: If this task is complete and there is nothing else to do in this task, then please set the task status to resolved. There are more entry points to work to do (=tasks) than workboards: There is the task search, and when searching for open tasks, results should list non-completed tasks and no tasks which seem to be done (if I understand correctly). Plus there are incorrect metrics and statistics when tickets with no further actions expected are left opened for no obvious reasons. Thanks for your understanding! Also, the workboard column name Complete Q3 2022/23 implies Jan-Mar 2023, that seems like a typo?