Page MenuHomePhabricator

Ensure that we collect appropriate data for Search platform SLIs
Closed, ResolvedPublic8 Estimated Story Points

Description

Make sure that all SLIs (see T335498) are being measured, so that we can ensure that we are meeting operational service level objectives. This task is about metric collection, creating the dashboards will follow in T338009.

We need some discussion on how to collect those metrics. Should we reuse the Search Satisfaction Schema? Should we use statsv (see also T315091#8311847)? As an example, there is a pageview grafana dashboard that could be helpful.

Summary of the chosen SLOs:
Latency

  • Special:Search latency
    • The amount of time it takes to return search results for a query. This includes all extra search features:
      • sister search
      • did you mean
  • MediaSearch latency
    • The amount of time it takes MediaSearch to return media results
  • Autocomplete latency
    • The amount of time it takes to return article suggestions based on autocompleted strings in the search bar
  • Search preview latency
    • The amount of time it takes for the search preview and its elements to respond to user actions
  • NTH: Bot latency
    • What is a reasonable amount of latency for bots? What is the best way to measure this?

Updates

  • Search update lag
    • The amount of time it takes updates/edits to wikis to reflected in search results – i.e. how long do I have to wait before I can search for something I just changed?

AC:

  • instrumentation exists for all SLIs defined in T335498

Event Timeline

Gehel triaged this task as High priority.May 1 2023, 8:08 AM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel renamed this task from Create instrumentation and monitoring for Search platform SLOs to Create instrumentation for Search platform SLOs.Jun 2 2023, 8:17 AM
Gehel updated the task description. (Show Details)
Gehel updated the task description. (Show Details)
Gehel renamed this task from Create instrumentation for Search platform SLOs to Ensure that we collect appropriate data for Search platform SLIs.Jun 2 2023, 8:27 AM
Gehel updated the task description. (Show Details)
Gehel set the point value for this task to 8.Jun 26 2023, 3:46 PM

While we have some of the metrics available from SearchSatisfaction (and potentially mediasearch_interaction) eventlogging schema, it's not clear we actually want to source data from those events. Per the linked dashboarding ticket our happy path, defining SLO's in grizzly, will require generating individual metrics directly. It leaves a question though, are we re-implementing all of these things in a new data collection process, or do we integrate metrics collection into the code that emits search satisfaction events.

The search satisfaction schema was heavily iterated on and used for AB testing in the past. The end result of this is that there are lots of awkard parts of the data collection, it was modified as necessary for different tests as time went on. In most cases we didn't want to invest significant effort refactoring the data collection for each test we ran, meaning many modifications are bolted on to awkard places. We could perhaps go through and simplify away many things we suspect we don't need anymore. Alternatively we could start over with something simple, direct, and focused to the task at hand. Building new data collection gives the opportunity to throw off old bagage, but would we really be able to get rid of the other data collection?


The amount of time it takes to return search results for a query.

Currently collected as the msToDisplayResults field of the SearchSatisfaction schema.

MediaSearch latency

I reviewed usage of $log (the MediaSearch js abstraction for event logging) in the MediaSearch repo and I don't see this metric collected anywhere. It could still be there somewhere that I'm simply not noticing.

Autocomplete latency

Currently collected as the msToDisplayResults field of the SearchSatisfaction schema.

The amount of time it takes for the search preview and its elements to respond to user actions

unknown, needs investigation

NTH: Bot latency

I can't think of a way to measure latency as seen by bots, they wont be sending us any metrics. We can pull the latency they see out of the mediawiki_cirrussearch_request events, have to decide on which mechanic we use for deciding what is a bot. Options include:

  • Whatever the webrequests is using (at one time was UA analysis, might be more complex these days)
  • Invert heuristics we've used before to filter out bots. Probably too many false positives. (mostly # requests per day per ip)
  • Consider all requests from public clouds to be bots. No clue what % of bots this would be, but might be a reasonable proxy for all bots.

Search update lag

In T320408 david added a metric to track this, visible at https://grafana.wikimedia.org/d/8xDerelVz/search-update-lag-slo?orgId=1

Change 939776 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/MediaSearch@master] Add timing metrics around search results

https://gerrit.wikimedia.org/r/939776

Change 939776 merged by jenkins-bot:

[mediawiki/extensions/MediaSearch@master] Add timing metrics around search results

https://gerrit.wikimedia.org/r/939776

Change 946979 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@master] ooui: SearchInputWidget must send request start events

https://gerrit.wikimedia.org/r/946979

Change 946979 merged by jenkins-bot:

[mediawiki/core@master] ooui: SearchInputWidget must send request start events

https://gerrit.wikimedia.org/r/946979

Change 948166 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/SearchVue@master] Track how long it takes to open a preview

https://gerrit.wikimedia.org/r/948166

Change 948645 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] Add Search SLI tracking

https://gerrit.wikimedia.org/r/948645

Change 948645 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Add Search SLI tracking

https://gerrit.wikimedia.org/r/948645

Change 948166 merged by jenkins-bot:

[mediawiki/extensions/SearchVue@master] Track how long it takes to open a preview

https://gerrit.wikimedia.org/r/948166

Change 948645 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Add Search SLI tracking

https://gerrit.wikimedia.org/r/948645

WikimediaEvents/searchSli.js
$( function () {
    const took = performance.timing.loadEventEnd - performance.timing.navigationStart;
    mw.track( 'timing.Search.FullTextResults', took );

The $() function stands for "dom-ready", in other words, it is a reliable memory-fired callback for the DOMContentLoaded event. This is significantly before the window.onload event and thus before the loadEventEnd metric is defined. Even something like $(window).on('load') is ineffective since, as the name suggests, loadEventEnd is marked when "the load event has ended".

The descriptions at T335499 and T335498 do not specify the intended meaning of timing.Search.FullTextResults, i.e. what it should reflect and thus what part of loadEventEnd is your signal, and what part is compromise/noise. Depending on what the intended signal is, a better fitting metric might exist.

As-is, this computation likely yield negative numbers which could heavily skew the statsd metric in question.

The only reliable way to obtain loadEventEnd is from one setTimeout(,0) tick after window.onload. And, given regular browser events do not have firing memory, you have to do the equivalent of what $() does underneath, which is to check document.readyState to determine if the event has already fired. Summarised from the equivalent code in Navigation Timing:

Navigation Timing extension
if ( document.readyState === 'complete' ) { setTimeout( cb ); }
else {
  window.addEventListener( 'load', function () { setTimeout( cb ); } );
}

Alternatively, you may want to use the domInteractive metric. This measures from navigation start, over time to first byte, over time to last byte, over the browser parsing the HTML into a fully formed DOM and then marks that metric and then invokes DOMContentLoaded/$(). loadEvent would additionally wait for all $() handlers, all window.onload handlers, all async JS to finish downloading and executing, and for below-the-fold images to have finished downloading and rendering. Whether that is desired, I don't know. Note that sister instrument WikimediaEvents/readingDepth.js uses domInteractive.

Alternatively, if visual rendering is what you're after, getEntries('paint') and first-contentful-paint may be of interest. This is the main KPI we recommemnd at https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Frontend_performance_practices.

Change 962755 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] searchSli: Fix broken 'performance' check and update to NT Level 2

https://gerrit.wikimedia.org/r/962755

Change 962755 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] searchSli: Fix broken 'performance' check and update to NT Level 2

https://gerrit.wikimedia.org/r/962755

Thanks for the feedback! I have to admit I hadn't thought too closely about loadEventEnd vs other options. I'll review the linked documentation and check with my teammates what we think the best option is. It sounds at first like first-contentful-paint might be best, sticking with the suggested best practices.

Change 963122 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/MediaSearch@master] search sli: Fix broken 'performance' check and update to NT level 2

https://gerrit.wikimedia.org/r/963122

Change 963122 merged by jenkins-bot:

[mediawiki/extensions/MediaSearch@master] search sli: Fix broken 'performance' check and update to NT level 2

https://gerrit.wikimedia.org/r/963122

There is a suggestion to change the collection method to something more appropriate.

Current implementation is using document loaded instead of paint complete, but that's good enough for our use case.