Create dashboards for Search SLOs
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Gehel
	Jun 2 2023, 8:25 AM

Description

Dashboard link => https://grafana-rw.wikimedia.org/d/xiWr1c5Iz/search-slos?forceLogin&forceLogin&orgId=1&from=now-1y&to=now

All SLOs defined in T335498 are measured and exposed as dashboards so that we can ensure that SLOs are met or that actions are taken to meet them.

Note that at the moment, we have defined SLI (what we want to measure), but SLO themselves (the level we want to achieve) isn't entirely clear. We will define it after seeing the current measurement. They will probably be defined as "latency is below 100 milliseconds for the 95%-ile of requests 99.9% of the time over 3 months" (numbers subject to change). It would be useful to have those numbers as parameters in the dashboards so that can play with them until we decide on a final SLO.

Our approach to SLOs is documented on Wikitech. Dashboards are usually implemented as Grizzly dashboards. In this case, the data is likely to come from the search satisfaction schema or web request logs, it might be easier to track them as superset dashboards, but this makes integration with other SLOs and reporting more complex.

Decision is made on where to create dashboards (superset / grafana grizzly dashboards)
Dashboards exists for all the SLOs defined in T335498

Details

	Subject	Repo	Branch	Lines +/-
	Search: add new SLOs	operations/grafana-grizzly	master	+34 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T335576 [Epic] Search SLOs
		Resolved		RKemper	T338009 Create dashboards for Search SLOs

Event Timeline

Gehel created this task.Jun 2 2023, 8:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 2 2023, 8:25 AM

Gehel mentioned this in T335499: Ensure that we collect appropriate data for Search platform SLIs.Jun 2 2023, 8:27 AM

Gehel updated the task description. (Show Details)Jun 2 2023, 8:31 AM

Gehel set the point value for this task to 5.Jun 26 2023, 3:52 PM

Gehel added a project: Data-Platform-SRE.Jun 27 2023, 12:58 PM

Gehel moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Jul 17 2023, 3:40 PM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.Aug 29 2023, 1:36 PM

Gehel moved this task from Ready for Work to Quarterly Goals on the Data-Platform-SRE board.Oct 11 2023, 8:29 AM

Gehel triaged this task as High priority.Oct 11 2023, 8:39 AM

Balthazar and I met last week. We took a look at the temporary dashboard and outlined some alerting threshold values for the various SLIs:

Alert thresholds

full text latency alert threshold around 4000 ms https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&orgId=1&from=1693181669857&to=1698905807314&viewPanel=14

autocomplete latency p95 (normally oscillates btw 300-525 ms across day) => >1000 for short time OR > 600 for 30mins https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1693181669857&to=1698905807314&orgId=1&viewPanel=16

Search preview p95 > 15,000 for short time or search preview p95 > 10,000 for longer period of time (possibly even tighter threshold) https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1694613913811&to=1698905807314&orgId=1&viewPanel=4

MediaSearch latency p95: Start with anything > 10,000 https://grafana.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?forceLogin&from=1693181669857&to=1698905807314&orgId=1&viewPanel=8

Next steps

We were thinking of the thresholds in terms of alerting, but we also need a quarterly SLO metric and accompanying grizzly dashboards. So the next step is to use the above thresholds as a guide but craft 90-day SLO versions of those thresholds. Probably stuff like "for 99% of the quarter full text latency is < 4000 ms", etc

Talked to @EBernhardson last week and one thing we were uncertain of is if it made sense to set SLOs on metrics such as MediaSearch latency p95 which, with the metric representing the actual time to render the view and not just the Elasticsearch backend response time, means that its possible for the SLO to be missed in the absence of there being anything wrong with Search services in particular. After talking with @Gehel, one thing we discussed is that some of these SLOs should be set in terms of what user experience we feel is acceptable; in this way we'll have a metric/objective that we can point to if, say, some change elsewhere in the stack leads to slowdowns. It's sort of analogous to the role unit tests play in refactoring: allowing you to make changes while being able to validate that the changes didn't break something.

So with that in mind, we'll just want to set reasonable user-centric values for the various SLOs, and make it clear in our documentation that there are different possible factors that could lead to the SLOs being missed rather than it all being down to the response time of our Elasticsearch backend.

RKemper claimed this task.Nov 14 2023, 7:22 PM

Change 977770 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/grafana-grizzly@master] Search: add new SLOs

https://gerrit.wikimedia.org/r/977770

gerritbot added a project: Patch-For-Review.Nov 27 2023, 6:52 PM

Gehel moved this task from Quarterly Goals to 2023.12.01 - 2023.12.31 on the Data-Platform-SRE board.Dec 12 2023, 9:30 AM

Gehel edited projects, added Data-Platform-SRE (2023.12.01 - 2023.12.31); removed Data-Platform-SRE.

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2023.12.01 - 2023.12.31) board.

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE (2023.12.01 - 2023.12.31).Dec 19 2023, 4:46 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

Change 977770 abandoned by Ryan Kemper:

[operations/grafana-grizzly@master] Search: add new SLOs

Reason:

template is built for prometheus queries; we'll do manual dashboard

https://gerrit.wikimedia.org/r/977770

Maintenance_bot removed a project: Patch-For-Review.Jan 8 2024, 10:30 PM

Finished adding the SLO dashboards to https://grafana-rw.wikimedia.org/d/H6f-bA7Sk/rkemper-search-sli-test?orgId=1&from=now-90d&to=now. Remaining steps:

Move dashboard to have a proper name rather than rkemper test
Go over documentation again
Make a new ticket to track future work to switch over to prometheus metrics and make an actual grizzly dashboard. This is blocked on the work to switch mediawiki metrics from graphite to prometheus; we should talk to o11y to get a rough idea of the timeline.

Gehel moved this task from Ready for Dev -- SRE/Ops to DPE-SRE on the Discovery-Search (Current work) board.Jan 16 2024, 3:19 PM

Made some various improvements to the dashboard: collated SLIs into a single row, added threshold markers for every SLI, added y axis labelling and added a soft max of 600ms to automcomplete latency since currently grafana was setting the y axis max below 600 due to no data points existing >= 600

Copied over the dashboard to https://grafana-rw.wikimedia.org/d/xiWr1c5Iz/search-slos?orgId=1

Gehel edited projects, added Data-Platform-SRE (2024.01.22 - 2024.02.11); removed Data-Platform-SRE (2024.01.01 - 2024.01.21).Jan 22 2024, 1:42 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.Jan 22 2024, 1:43 PM

Finished the documentation. With the new dashboard up in https://grafana-rw.wikimedia.org/d/xiWr1c5Iz/search-slos?orgId=1, this work is complete.

RKemper mentioned this in T355589: Migrate Search SLOs to prometheus based metrics.Jan 22 2024, 7:28 PM

A few comments:

It is not super clear what exactly we are measuring for each of the 4 metrics (full text search, autocomplete, search preview and media search). Adding a info popup (like for the SLOs) would help, or a text panel with an overall explanation. I'd like to have a short description of the feature we are measuring (maybe just a link to the wiki page of the feature if it exists), and a short description of how we compute the metric. I think that for most of them, it is the time between user action and start of painting in the browser, but I'm not entirely sure. Autocomplete is probably different.
Minor comment: the description of the SLOs says "within the time window". This window is always 30 days. This could be confused with the window of the overall dashboard (which is 90 days by default).
It is not clear when we are breaking our SLOs. We should probably have a line on the SLOs (just like on the raw metrics) that shows our target. We can start with a target that we are currently reaching.
We should add a link to https://wikitech.wikimedia.org/wiki/SLO/Search for a more in depth description of the SLOs.

Added new threshold markers at 95% for the 4 SLO graphs. We may want to revise the % SLO upwards, but let's stick with 95% for now until we get another quarter of data.

Left to do:

short description and/or link to the wiki page of the feature we are measuring
Consider changing time window defaults

Gehel edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE (2024.01.22 - 2024.02.11).Feb 9 2024, 10:46 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 1 2024, 4:00 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 1 2024, 4:21 PM

RKemper updated the task description. (Show Details)Mar 4 2024, 7:37 PM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:45 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 22 2024, 8:45 AM

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE (2024.03.25 - 2024.04.14).Apr 15 2024, 12:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Added further context to the SLI section of the documentation explaining what each query type actually means. I believe there's no more oustanding TODOs on this task.

Search Preview is still marked as "todo" in the documentation linked above.

Filled out the search preview SLI info in the documentation, and also updated the existing SLIs to make it more clear what exactly is being measured (page render, etc).

Gehel closed this task as Resolved.May 1 2024, 12:21 PM

Gehel moved this task from Needs Review to Done on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Create dashboards for Search SLOsClosed, ResolvedPublic5 Estimated Story PointsActions