Page MenuHomePhabricator

[SPIKE] Search instrumentation for a/b testing of search widget move
Closed, ResolvedPublic

Description

Background

We would like to measure the effects of search changes in two parts; first - moving the search to a more prominent location within the header of the page, second - updating the functionality of the search widget. This task is created to explore previous search instrumentation and gauge whether this instrumentation can be reused for these purposes

Metrics

  • Search sessions initiated
  • Search sessions shown (search results shown to the user)
  • Search sessions completed

Questions

Note: if not, we'll probably have to do comparative analysis from before/after the change and/or on wikis with similar search patterns. Also, since the current schema is deactivated, we might be unable to get yoy comparisons

Related Objects

Event Timeline

ovasileva triaged this task as Medium priority.May 4 2020, 10:04 AM
ovasileva updated the task description. (Show Details)

Can we reuse https://meta.wikimedia.org/wiki/Schema:Search?

Yes. However, that instrumentation hasn't been active since before October 2019 (see T233614#5566168) and I'm not sure where the implementation was so it'd be non-trivial to resurrect it.

Schema:MobileWebSearch, a mobile-specific clone of that schema, is still active. We could clone and modify the implementation to emit Search events.

FYI both Schema:Search and Schema:MobileWebSearch will also allow you to answer "How many search sessions are started per browser session?"

Is it possible to structure an a/b test for:
Moving search to the header

Moving the search widget to the header necessarily requires changes in HTML structure and associated styles. The complexity of the experimental setup depends on whether we limit the scope of our experiment:

Logged-in users only

Yes. Since requests from logged-in users always served by the application servers, we're free to send buckets of users different treatments with different HTML and different styles and have them be fresh.

We can assume that the user's ID is randomly distributed and therefore bucketize them based on it, e.g.

$featureManager->registerFeature( 'NewSearchTreatment', $user->isLoggedIn() && $user->getID() % 2 === 0 );

All users

Yes but it won't be as simple as the above. Requests from logged-out users are mostly served by the edge caches. In the best case a response is fresh but in the worst case a response can be 4 days old (see https://wikitech.wikimedia.org/wiki/Varnish#TTL). We could either:

  1. Modify the client-side code to bucket the user and then render the appropriate search widget treatment. Delays and flashes of content may alter the way users interact with the widget(s) and so will need to be minimised
  2. Enable the instrumentation for all users for one week to establish a baseline. Afterwards, enable the new treatment for all users for two weeks – enough time for all users to be delivered the new treatment and to account for variations in behaviour across the week

Swapping out the search widgets

While moving the search widget does require changes to HTML structure and associated styles, swapping out the implementation and/or treatment of the widget doesn't. Moreover, the widget executes only on the client.

We're free to bucket all users and deliver the associated treatment.

phuedx removed phuedx as the assignee of this task.May 5 2020, 4:20 PM

I'd be happy to go into more technical detail if others feel that it's necessary.

I think this task has some potential overlap with T249366 so it may be worth linking.

As I understand it, this task focuses on evaluating the move of search (before and after) in the header. This may require changes to the existing JavaScript experience.

T249366 focuses on the new Vue JavaScript search form only.

@Niedzielski - apologies for opening another task for this. We can merge them and write up some notes on how we would a/b test the swapping of the widgets or keep that open to address later on - as the new search form is JS only, I think the second question is more straightforward already for both logged-in users and anons so either approach seems fine to me. i.e. do we need a spike on how to do that or can we just go directly into the details of implementation with some analysis? If not, I think we can just go directly create the following tasks:

  • Set up the schema itself as a desktop clone of Schema:MobileWebSearch - metrics are the same in both
  • a/b test of the move
  • analysis of the move
  • a/b test of the widget
  • analysis of the widget

@Niedzielski, @phuedx, @Mayakp.wiki, @MNeisler - does that sound reasonable (and sorry for the mass ping!)?

This is fuzzy to me but here's my understanding of this task from the description and @phuedx's comments above.


These are the A/B tests wanted:

  • Moving the search to a more prominent location within the header of the page.
  • Updating the functionality of the search widget.

That means that for each test, users are bucketed both in terms of tracking and in terms of implementation. The bucketing will happen in Vector. Sam has added details on how to bucket logged-in users that makes sense to me. The all users test looks more involved unless we go with option 2 mentioned in the comment.


It's my understanding that Sam is suggesting a clone of Schema:MobileWebSearch as the schema. Within that schema, I'm assuming the mapping of metrics to requirements is:

  • Search sessions initiated => "session-start"
  • Search sessions shown (search results shown to the user) => "impression-results"
  • Search sessions completed => "click-result" or "hide-search-suggestions"

Schema:Search looks pretty similar. I wasn't sure if Sam was suggesting cloning Schema:MobileWebSearch instead because of active policy or another reason.

All of this is reported for JavaScript only.


The are the questions and work I'm aware of:

  • Vector needs to be modified to bucket users and swap the implementations for both the movement and widget tests.
  • Schema:MobileWebSearch needs to be cloned so that the new search and the old search can use it?
  • Is the old search currently instrumented with any schema though? I haven't looked.
  • The new search doesn't exist yet so it will need the instrumentation. I think T249366 could be revised to focus on the instrumentation in the new implementation.
  • Do we need to evaluate the different search API requests? The old search will be using the MediaWiki Action API, the new search will be using the Core Platform Team's REST API.
  • Do we need any X-Analytics headers for skin version (e.g., 1 (or not present) = Legacy, 2 = Latest) or search version (e.g., search=1 and search=2)? It's possible that a test wiki may have the latest version of the skin deployed but the new Vue.js search disabled) at least initially.

a/b test of the move
analysis of the move
a/b test of the widget
analysis of the widget

Makes sense. The A/B tests will take some time to run and will need to be analyzed by @Mayakp.wiki and @MNeisler.

The are the questions and work I'm aware of:

  • Vector needs to be modified to bucket users and swap the implementations for both the movement and widget tests.
  • Schema:MobileWebSearch needs to be cloned so that the new search and the old search can use it?
  • Is the old search currently instrumented with any schema though? I haven't looked.

No, there is https://meta.wikimedia.org/wiki/Schema:Search but it's been disabled for a while so if we decide to only a/b test logged in users for the move we'll have to build up the schema at least a little bit ahead of time to get a baseline for anons.

  • The new search doesn't exist yet so it will need the instrumentation. I think T249366 could be revised to focus on the instrumentation in the new implementation.

Makes sense, let's do that.

  • Do we need to evaluate the different search API requests? The old search will be using the MediaWiki Action API, the new search will be using the Core Platform Team's REST API.

I guess that would be covered by the a/b test on the new vs old search? Or do you mean something more specific?

  • Do we need any X-Analytics headers for skin version (e.g., 1 (or not present) = Legacy, 2 = Latest) or search version (e.g., search=1 and search=2)? It's possible that a test wiki may have the latest version of the skin deployed but the new Vue.js search disabled) at least initially.

That's a good point. @Mayakp.wiki, @MNeisler - any thoughts on this?

a/b test of the move
analysis of the move
a/b test of the widget
analysis of the widget

Makes sense. The A/B tests will take some time to run and will need to be analyzed by @Mayakp.wiki and @MNeisler.

I'll set these up as a baseline and we can create the remainder as details become clearer.

  • Is the old search currently instrumented with any schema though? I haven't looked.

No, there is https://meta.wikimedia.org/wiki/Schema:Search but it's been disabled for a while so if we decide to only a/b test logged in users for the move we'll have to build up the schema at least a little bit ahead of time to get a baseline for anons.

Apologies! I am fully confused so I've re-read the above and started digging into the implementations. Here is my current understanding:

  • The new search doesn't exist yet so it will need the instrumentation. I think T249366 could be revised to focus on the instrumentation in the new implementation.

Makes sense, let's do that.

I've tweaked the task to focus on the new implementation only. Further edits welcome!

  • Do we need to evaluate the different search API requests? The old search will be using the MediaWiki Action API, the new search will be using the Core Platform Team's REST API.

I guess that would be covered by the a/b test on the new vs old search? Or do you mean something more specific?

I wasn't very clear. I meant that it's my understanding that request traffic can be monitored. I wasn't sure if there were requirements around that for new or old APIs.

Is the old search currently instrumented with any schema though? I haven't looked.

I think the old search implementation actually uses searchSatisfaction. Should we use that instead? I'm very confused!

Yes, desktop search is currently instrumented with searchSatisfaction, which appears to still be active and recording a high number of events. It’s a little more complex than we need but includes all the fields we'd need to calculate the proposed metrics.
Per discussions with @ovasileva today, I'll follow up to the maintainers of this schema to see if there are any issues with us reusing this or if they recommend making a clone of MobileWebSearch instead.

Do we need any X-Analytics headers for skin version (e.g., 1 (or not present) = Legacy, 2 = Latest) or search version (e.g., search=1 and search=2)? It's possible that a test wiki may have the latest version of the skin deployed but the new Vue.js search disabled) at least initially.

I think incorporating a field into the search schema noting which skin version is being used would be sufficient for calculating the metrics we are interested in here. The X-analytics headers could be used to obtain some view-based metrics for each skin type but querying that data from webrequest is time-intensive and minimally useful due to the 90-day data retention guidelines (unless there is another use case I'm not thinking of).

@MNeisler - would it be okay if I assigned this to you for the time being? It seems the open questions are:

  • Should we duplicate the mobile schema or can we use searchSatisfaction?
  • If we're using search satisfaction, can we add a field for skin version?

Met with @mpopov yesterday about the status of searchSatisfaction .

Summary of notes below:

  • Confirmed that searchSatisfaction is active and maintained by the Search platform team.
  • It includes a number of features that might be useful for our analysis, such as:
    • In addition to the mwSessionId, it includes a unique searchSessionId which identifies a user performing searches within a 10-minute timespan and persists longer than the mwSessionId.
    • deterministic bucketing for AB tests and ability to specify sampling rates on a per wiki basis.
    • scroll tracking
  • It currently only records events for desktop and non-minerva skins. It was not instrumented on Minerva due to performance concerns. See rEWMV958aad0ebd6d3a69ef342094b9bba94f5de60b1a and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+/master/extension.json#92. It is a large amount of code and if we end up resuing, we will need to make sure any changes will not cause similar performance concerns.

@EBernhardson - Any concerns with the web team using searchSatisfaction to measure the effects of the planned search changes? If we reuse, we would like to add a field to the schema to track skin version if possible.