Page MenuHomePhabricator

Define SLOs and error budget for WDQS
Open, MediumPublic

Description

Following the 2020-07-23 WDQS Outage, it was recognized that given WDQS exposes a public endpoint with the potential for overly expensive queries to compromise service availability by knocking WDQS instances offline, we need to create SLOs and an error budget accordingly and publicize it .

  • SLOs defined and reviewed
  • Error budget defined and reviewed by broader SRE team

Event Timeline

Gehel updated the task description. (Show Details)
Gehel triaged this task as Medium priority.Sep 8 2020, 7:13 PM

Note that a similar discussion was already started on T199228. It was closed, waiting for architectural changes to be implemented first.

Will this ticket be resolved by the dashboard built for https://phabricator.wikimedia.org/T293027 and a final SLO value we settle on?