Page MenuHomePhabricator

Create dashboards for Search SLOs
Closed, ResolvedPublic5 Estimated Story Points

Description

Dashboard link => https://grafana-rw.wikimedia.org/d/xiWr1c5Iz/search-slos?forceLogin&forceLogin&orgId=1&from=now-1y&to=now


All SLOs defined in T335498 are measured and exposed as dashboards so that we can ensure that SLOs are met or that actions are taken to meet them.

Note that at the moment, we have defined SLI (what we want to measure), but SLO themselves (the level we want to achieve) isn't entirely clear. We will define it after seeing the current measurement. They will probably be defined as "latency is below 100 milliseconds for the 95%-ile of requests 99.9% of the time over 3 months" (numbers subject to change). It would be useful to have those numbers as parameters in the dashboards so that can play with them until we decide on a final SLO.

Our approach to SLOs is documented on Wikitech. Dashboards are usually implemented as Grizzly dashboards. In this case, the data is likely to come from the search satisfaction schema or web request logs, it might be easier to track them as superset dashboards, but this makes integration with other SLOs and reporting more complex.

AC

  • Decision is made on where to create dashboards (superset / grafana grizzly dashboards)
  • Dashboards exists for all the SLOs defined in T335498

Event Timeline

Gehel set the point value for this task to 5.Jun 26 2023, 3:52 PM
Gehel triaged this task as High priority.Oct 11 2023, 8:39 AM

Balthazar and I met last week. We took a look at the temporary dashboard and outlined some alerting threshold values for the various SLIs:

Alert thresholds

full text latency alert threshold around 4000 ms https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&orgId=1&from=1693181669857&to=1698905807314&viewPanel=14

autocomplete latency p95 (normally oscillates btw 300-525 ms across day) => >1000 for short time OR > 600 for 30mins https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1693181669857&to=1698905807314&orgId=1&viewPanel=16

Search preview p95 > 15,000 for short time or search preview p95 > 10,000 for longer period of time (possibly even tighter threshold) https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1694613913811&to=1698905807314&orgId=1&viewPanel=4

MediaSearch latency p95: Start with anything > 10,000 https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1693181669857&to=1698905807314&orgId=1&viewPanel=8

Next steps

We were thinking of the thresholds in terms of alerting, but we also need a quarterly SLO metric and accompanying grizzly dashboards. So the next step is to use the above thresholds as a guide but craft 90-day SLO versions of those thresholds. Probably stuff like "for 99% of the quarter full text latency is < 4000 ms", etc

Talked to @EBernhardson last week and one thing we were uncertain of is if it made sense to set SLOs on metrics such as MediaSearch latency p95 which, with the metric representing the actual time to render the view and not just the Elasticsearch backend response time, means that its possible for the SLO to be missed in the absence of there being anything wrong with Search services in particular. After talking with @Gehel, one thing we discussed is that some of these SLOs should be set in terms of what user experience we feel is acceptable; in this way we'll have a metric/objective that we can point to if, say, some change elsewhere in the stack leads to slowdowns. It's sort of analogous to the role unit tests play in refactoring: allowing you to make changes while being able to validate that the changes didn't break something.

So with that in mind, we'll just want to set reasonable user-centric values for the various SLOs, and make it clear in our documentation that there are different possible factors that could lead to the SLOs being missed rather than it all being down to the response time of our Elasticsearch backend.

Change 977770 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/grafana-grizzly@master] Search: add new SLOs

https://gerrit.wikimedia.org/r/977770

Change 977770 abandoned by Ryan Kemper:

[operations/grafana-grizzly@master] Search: add new SLOs

Reason:

template is built for prometheus queries; we'll do manual dashboard

https://gerrit.wikimedia.org/r/977770

Finished adding the SLO dashboards to https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?orgId=1&from=now-90d&to=now. Remaining steps:

  • Move dashboard to have a proper name rather than rkemper test
  • Go over documentation again
  • Make a new ticket to track future work to switch over to prometheus metrics and make an actual grizzly dashboard. This is blocked on the work to switch mediawiki metrics from graphite to prometheus; we should talk to o11y to get a rough idea of the timeline.

Made some various improvements to the dashboard: collated SLIs into a single row, added threshold markers for every SLI, added y axis labelling and added a soft max of 600ms to automcomplete latency since currently grafana was setting the y axis max below 600 due to no data points existing >= 600

Copied over the dashboard to https://grafana-rw.wikimedia.org/d/xiWr1c5Iz/search-slos?orgId=1

A few comments:

  • It is not super clear what exactly we are measuring for each of the 4 metrics (full text search, autocomplete, search preview and media search). Adding a info popup (like for the SLOs) would help, or a text panel with an overall explanation. I'd like to have a short description of the feature we are measuring (maybe just a link to the wiki page of the feature if it exists), and a short description of how we compute the metric. I think that for most of them, it is the time between user action and start of painting in the browser, but I'm not entirely sure. Autocomplete is probably different.
  • Minor comment: the description of the SLOs says "within the time window". This window is always 30 days. This could be confused with the window of the overall dashboard (which is 90 days by default).
  • It is not clear when we are breaking our SLOs. We should probably have a line on the SLOs (just like on the raw metrics) that shows our target. We can start with a target that we are currently reaching.
  • We should add a link to https://wikitech.wikimedia.org/wiki/SLO/Search for a more in depth description of the SLOs.

Added new threshold markers at 95% for the 4 SLO graphs. We may want to revise the % SLO upwards, but let's stick with 95% for now until we get another quarter of data.

Left to do:

  • short description and/or link to the wiki page of the feature we are measuring
  • Consider changing time window defaults

Added further context to the SLI section of the documentation explaining what each query type actually means. I believe there's no more oustanding TODOs on this task.

Search Preview is still marked as "todo" in the documentation linked above.

Filled out the search preview SLI info in the documentation, and also updated the existing SLIs to make it more clear what exactly is being measured (page render, etc).