Run an A/B test using suggestions generated using glent Method 1
Open, HighPublic
Actions

Assigned To

None

Authored By

	TJones
	Sep 10 2020, 9:03 PM

Description

User Story: As a developer, I want to know that Glent Method 1 gives suggestions that users will use so that we can deploy it to production with confidence.

Enable an A/B test using M‍1 so that we can compare number of suggestions and clickthrough rates to M0 and the phrase suggester in order to make a decision.

Acceptance Criteria:

We have enough data from the A/B test to be able to decide whether Method 1 is worth deploying.

Details

Subject	Repo	Branch	Lines +/-
searchSatisfaction: remove sampling	mediawiki/extensions/WikimediaEvents	master	+23 -47
searchSatisfaction: Remove subTest handling	mediawiki/extensions/WikimediaEvents	master	+56 -102
Refactor UserTesting to assign test buckets	mediawiki/extensions/CirrusSearch	master	+722 -495
Revert "Turn on glent m1 AB test"	mediawiki/extensions/WikimediaEvents	wmf/1.36.0-wmf.37	+1 -7
Revert "Turn on glent m1 AB test"	mediawiki/extensions/WikimediaEvents	wmf/1.36.0-wmf.36	+1 -7
Revert "Turn on glent m1 AB test"	mediawiki/extensions/WikimediaEvents	master	+1 -7
Turn on glent m1 AB test	mediawiki/extensions/WikimediaEvents	wmf/1.36.0-wmf.35	+7 -1
Turn on glent m1 AB test	mediawiki/extensions/WikimediaEvents	master	+7 -1
Add fallback profile including glent m1	mediawiki/extensions/CirrusSearch	wmf/1.36.0-wmf.35	+23 -0
Add Cirrus testing profile for glent m1	operations/mediawiki-config	master	+16 -0
Add fallback profile including glent m1	mediawiki/extensions/CirrusSearch	master	+23 -0

Related Objects
Search...

Status	Assigned	Task
Open	None	T212884 [EPIC] Improve Search Suggestions with NLP (Did You Mean / Glent)
Open	None	T212889 [EPIC-ish][Milestone 1] Implement NLP Search Suggestion Method 1 for 10 languages
Resolved	TJones	T262610 Enable ICUTokNorm() for Glent M0 and M1
Open	None	T262612 Run an A/B test using suggestions generated using glent Method 1

Event Timeline

TJones created this task.Sep 10 2020, 9:03 PM

CBogen moved this task from needs triage to Current work on the Discovery-Search board.Sep 14 2020, 3:47 PM

CBogen edited projects, added Discovery-Search (Current work); removed Discovery-Search.

We are waiting to discuss T262845 and determine whether that ticket needs to be completed before this one.

CBogen moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Sep 21 2020, 11:02 PM

Change 672564 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672564

Change 672565 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/mediawiki-config@master] Add Cirrus testing profile for glent m1

https://gerrit.wikimedia.org/r/672565

Change 672566 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/WikimediaEvents@master] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/672566

The above set of patches should start the test. The first two we should merge and deploy soon-ish, before actually starting the test we will want to run some test queries against prod and make sure it looks as we expect.

EBernhardson moved this task from Waiting to Needs review on the Discovery-Search (Current work) board.Mar 16 2021, 12:04 AM

Change 672564 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672564

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.36; 2021-03-23).Mar 16 2021, 10:00 AM

Change 672565 merged by jenkins-bot:
[operations/mediawiki-config@master] Add Cirrus testing profile for glent m1

https://gerrit.wikimedia.org/r/672565

Mentioned in SAL (#wikimedia-operations) [2021-03-16T23:31:58Z] <krinkle@deploy1002> Synchronized wmf-config/InitialiseSettings.php: I1ca4f30c2, T262612 (duration: 00m 57s)

Change 672825 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.36.0-wmf.35] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672825

Change 672825 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.36.0-wmf.35] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672825

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.35; 2021-03-16); removed MW-1.36-notes (1.36.0-wmf.36; 2021-03-23).Mar 18 2021, 12:00 AM

EBernhardson claimed this task.Mar 18 2021, 7:13 PM

Change 672566 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/672566

Change 674115 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.35] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/674115

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.36; 2021-03-23); removed MW-1.36-notes (1.36.0-wmf.35; 2021-03-16).Mar 22 2021, 11:00 PM

Change 674115 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.35] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/674115

Mentioned in SAL (#wikimedia-operations) [2021-03-22T23:18:56Z] <ebernhardson@deploy1002> Synchronized php-1.36.0-wmf.35/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: T262612: Start glent m1 ab test (duration: 01m 53s)

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.35; 2021-03-16); removed MW-1.36-notes (1.36.0-wmf.36; 2021-03-23).Mar 23 2021, 12:00 AM

EBernhardson moved this task from Needs review to Waiting on the Discovery-Search (Current work) board.Mar 23 2021, 6:13 PM

Test is started, results will be found in the superset Search Query Suggestions dashboard. Data is loaded into this dashboard daily, with the prior days data arriving around 3:00 UTC. The test will be run for 7 days, assuming data collection looks reasonable that means turning off next Monday.

Data is suspicious. The mismatch bucket, which has searches where the testing bucket reported by the backend is different than the frontend expected, Is 44% of all search requests. The backend aggregation looks to be a bit optimistic here as well, the reported bucket is whichever test it saw first (unordered) on a per-session basis, rather than if that particular search reported a mismatch. I'm currently testing a patch that will mark a full session as mismatch if any event in the session is a mismatch, should hopefully get a better idea of the scope of the issue.

Counting mismatches as any query in a session that contains mismatched events, we have 42% of sessions and 52% of search requests falling into the mismatched bucket. In some testing in an incognito window, by the time i figured out how to set the breakpoint inside searchSatisfaction.js my subTest was already set to mismatched. Clearly we need to dig into the search satisfaction tracking and figure out whats going on here if we want to have usable AB test results.

I've been able to enter a mismatched state from incongnito windows multiple times now, but it's not clear what the trigger is. It seems we have two different options: We could try and fix the frontend bucketing, or we previously implemented bucketing in the backend as well. But for some reason i can't remember we quickly transistioned from the backend doing bucketing to doing the bucketing inside the frontend browser code. Perhaps the problem was that the only way thread arbitrary extra data like a bucket through api responses is to inject them into headers, or something like that (but we figured out something, which is how opensearch can still tell search satisfaction which implementation returned results).

That particular backend bucketing code looks straight forward enough, might be plausible to use here.

Looking at this from more of an events/stats perspective, what can we see is different in the mismatched sessions? I first noticed that for automatically rewritten queries mismatched sessions only see 10% with interaction, but the control and test bucket are around 30%. Similarly mismatch sessions are only rewriting 45% of zero results queries, while the test and control buckets are seeing closer to 60% rewrite rates.

Poking at the collected data for one day, the number of searches per session looks consistent but plotting the number of sessions by session length is suspicious with s a big bump (on log scale!) in it for sessions ending at around 100 seconds in the mismatch bucket.

T262612_num_sessions_by_session_length.png (267×378 px, 24 KB)

There is something categorically different about the sessions in the mismatch bucket, but it's really not clear what that might be.

Quickly looked when I saw that frwiki has the new search widget enabled but not dewiki/enwiki. Looking at the data it seems frwiki is heavily affected (~20% of the sessions have an event in mismatch or invalid as opposed to 1%/2% for other wikis):

(period 2021-03-23T00:00:00 to 2021-03-23T04:00:00)

	total_sessions	invalid_sessions	pct_invalid
wiki			
+------+--------------+----------------+-----------+
|  wiki|total_sessions|invalid_sessions|pct_invalid|
+------+--------------+----------------+-----------+
|dewiki|        874925|           16805|       1.92|
|enwiki|       4442523|           98517|       2.22|
|frwiki|         33366|            6423|      19.25|
+------+--------------+----------------+-----------+

I think one of the reason bucketing was done in the frontend was to better detect the search session boundaries, doing this on the backend without a state per identity you would have to set arbitrary boundaries I think.

Related to frwiki the autocomplete events are completely off to what we usually get (7x more fulltext events than autocomplete), suggesting that the new widget is to blame. I'd suggest ignoring frwiki for this test.

In T262612#6947994, @dcausse wrote:
Quickly looked when I saw that frwiki has the new search widget enabled but not dewiki/enwiki. Looking at the data it seems frwiki is heavily affected (~20% of the sessions have an event in mismatch or invalid as opposed to 1%/2% for other wikis):

(period 2021-03-23T00:00:00 to 2021-03-23T04:00:00)
	total_sessions	invalid_sessions	pct_invalid
wiki			
+------+--------------+----------------+-----------+
|  wiki|total_sessions|invalid_sessions|pct_invalid|
+------+--------------+----------------+-----------+
|dewiki|        874925|           16805|       1.92|
|enwiki|       4442523|           98517|       2.22|
|frwiki|         33366|            6423|      19.25|
+------+--------------+----------------+-----------+

It looks like this came from event.searchsatisfaction? Suspecting that is also including autocomplete, when only considering the full text (discovery.search_satisfaction_daily) i get a breakdown like:

+--------------------+------+-------------------------------+
|              bucket|  wiki|count(DISTINCT searchSessionId)|
+--------------------+------+-------------------------------+
|T262612_glent_m01...|dewiki|                          29713|
|T262612_glent_m01...|dewiki|                          29647|
|            mismatch|dewiki|                          27817|
|T262612_glent_m01...|enwiki|                         113547|
|T262612_glent_m01...|enwiki|                         112501|
|            mismatch|enwiki|                         132518|
|T262612_glent_m01...|frwiki|                            939|
|T262612_glent_m01...|frwiki|                            860|
|            mismatch|frwiki|                          54936|
+--------------------+------+-------------------------------+

frwiki is clearly broken, but en and de are also showing huge amounts of mismatches. I suppose in a way this makes sense because we don't detect mismatch in the autocomplete tracking, that only gets set if the user lands on Special:Search.

Overall the frwiki thing does seem like a major bug we will have to deal with, but i'm not sure the remaining wikis are still something we can trust the outputs of.

In T262612#6946962, @EBernhardson wrote:

Poking at the collected data for one day, the number of searches per session looks consistent but plotting the number of sessions by session length is suspicious with s a big bump (on log scale!) in it for sessions ending at around 100 seconds in the mismatch bucket.

Not sure what these are still, but looking by wiki I can see that this bump comes exclusively from enwiki, fr and de have nice smooth declines in the session length metric. I don't really know that the decline should be smooth, but the others are and it seems likely it would.

David suggested that starting a session without going through autocomplete could perhaps be a source of problems. Looking into specifically sessions starting on enwik iand dewiki, of the sessions that have mismatch events ~75% of those sessions have a mismatch as the first event we see. Of the sessions that have an autocomplete dt prior to a fulltext dt (filtering ac-only), it looks like only 13% of those sessions transition into the mismatch state.

So almost certainly we should be looking closer at session starts. Perhaps could also look at handling when a new session starts after the previous session times out.

I took a sample of 30 complete sessions, joined against cirrus backend logs for information like referers and actual query strings. I reviewed this for inconsistencies, then tried to calculate some ballpark stats about how prevalent those inconsistencies are in the full dataset of sessions that are in a mismatch state. The most common are:

Search queries that contain sourceid=Mozilla-search in the query string
We aren't currently testing autocomplete, but worth noting most autocomplete requests don't seem to include the cirrusUserTesting parameter
Searches referred from www.wikipedia.org
Sessions where skin autocomplete correctly submits user testing, but then submitting with fulltext=1 (so Special:Search, basically) without the cirrusUserTesting parameter

The first part to clean up, perhaps the most obvious, is sessions that start "somewhere else". We can't have any guarante that users landing on Special:Search will have come from an approved place that sets our testing query parameters. I'm going to spend a little more time thinking through/testing how we can do bucket assignment in the backend and have the frontend only report the chosen buckets. The first place of guaranteed interaction is the backend code, it seems trying to do this in the frontend is only going to be error prone.

Tracing things back through git history, end up at 1fcba848 from T121542 adding the trigger functionality. The commit message justifys adding ab testing to frontend because textcat was going to need some special query parameters, and this allowed the frontend to provide the single testing parameter instead of using various cirrus debug query params. I found this a bit unclear, so pieced together a bit more of the history.

Reading a bit deeper into the ticket and reading ancient WikimediaEvents git history, it looks like at the time we ran tests in the backend only. When we did AB testing prior to that with the backend component only we analyzed results by joining cirrus logs + webreq logs (same data that feeds mjolnir today). This strategy couldn't capture click throughs to other subdomains, so we adjusted the frontend logging we had to also do AB testing. Unfortunately I couldn't find anything written down on exactly why we put the bucketing in the frontend, instead of having the backend report the bucket to the frontend.

In T262612#6948038, @dcausse wrote:

I think one of the reason bucketing was done in the frontend was to better detect the search session boundaries, doing this on the backend without a state per identity you would have to set arbitrary boundaries I think.

I'm thinking the backend wouldn't know anything about the session, it would assign a random constant bucket using the user identity as a seed. This would happen for all requests regardless of source. For the frontend code I'm thinking we can do a request once at the beginning of a session to ask the api what bucket we are in, and then store that in the session much like we store the bucketing decision now. Likely we can fit that data into a cirrus-config-dump api response, perhaps with an api flag that returns only that piece of information.

This doesn't fully work though, in particular autocomplete and morelike requests are cached in edge POPs. We can potentially ignore morelike, but autocomplete only works because we set cirrusUserTesting=control as part of the request. With this set responses are cached per-bucket. If we were to stop sending cirrusUserTesting and let the backend vary autocomplete parameters we would have users receiving cached responses for the wrong bucket.

I don't have a solution for the autocomplete problem. Perhaps we need a hybrid solution where buckets are constantly assigned from the backend, and Special:Search auto-magically uses the bucket, but autocomplete requests will still have to include the query string parameter. This feels messy, will ponder more.

I don't have a solution for the autocomplete problem. Perhaps we need a hybrid solution where buckets are constantly assigned from the backend, and Special:Search auto-magically uses the bucket, but autocomplete requests will still have to include the query string parameter. This feels messy, will ponder more.

The problem is as long as the backend is calculating the bucket we then have to somehow run one request through the caches and down to php for every user. This problem is compounded by significant numbers of sessions only interacting with autocomplete. If we wait, like we currently do, until the user interacts with the search box to initialize the session we would have to round trip to virginia to get the bucket before we can issue the first autocomplete request. If we pre-initialize the session to know the bucket before the user shows intent to type we have to do that for basically every potential user, seems like a non-starter.

What we could do is allow sessions to start without a bucket, and then add it to the logging once we know what it is. This doesn't prevent the frontend from sampling, but it does mean that unsampled requests will still get results modified by buckets the backend enabled for them.

index.php requests get automatic bucket assignment, ignores cirrusUserTesting query param. The response already includes CirrusSearchBackendUserTests in the javascript config when CirrusSearch is invoked. afaik none of the ways to invoke cirrussearch through index.php are cacheable, but we should verify that is true and perhaps set an explicit no-cache flag on the output.
api.php requests get no automatic bucket assignment. Continues to apply cirrusUserTesting param.
Add CirrrusSearchBackendUserTests, mimicking js config, to cirrus-config-dump api call. Add a compact option that only returns this value. This call will explicitly enable the auto-assignment of buckets, since api.php doesn't by default
When initializing a session from Special:Search store the reported bucket in session state for use in future autocomplete requests
When initializing a session from autocomplete send the request to cirrus-config-dump in parallel to autocomplete requests. Log events with some not-initialized-yet marker. Once the config request returns store in same browser side session state.

The not-initialized-yet marker will make analysis a bit more annoying, depending on purpose you have to decide between discarding those events or backfilling the subTest from later in the session. It seems plausible we can backfill in most occasions, but it's an extra step. There are separate complications that autocomplete doesn't seem to be attaching cirrusUserTesting to autocomplete requests, and frwiki with the vue.js refresh isn't sending cirrusUserTesting anywhere (in a quick test, more rigorous analysis could be performed). Basically its a bit of work, but without first fixing autocomplete data collection it's hard to verify through data that it works as intended.

Change 676158 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676158

Realized the superset dashboard doesn't break down any stats by wiki, and adding that isn't particularly easy. The most important stat is probably the prevalence of mismatch sessions, here is a quick breakdown from hive for a single day:

bucket	wiki	count(DISTINCT searchSessionId)
T262612_glent_m01:control	dewiki	29713
T262612_glent_m01:glent_m01	dewiki	29647
mismatch	dewiki	27817
T262612_glent_m01:control	enwiki	113547
T262612_glent_m01:glent_m01	enwiki	112501
mismatch	enwiki	132518
T262612_glent_m01:control	frwiki	939
T262612_glent_m01:glent_m01	frwiki	860
mismatch	frwiki	54936

Change 676158 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676158

Change 676350 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.37] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676350

Change 676351 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.36] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676351

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.38; 2021-04-06); removed MW-1.36-notes (1.36.0-wmf.35; 2021-03-16).Apr 1 2021, 7:00 PM

Change 676351 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.36] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676351

Change 676350 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.37] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676350

Mentioned in SAL (#wikimedia-operations) [2021-04-01T23:32:29Z] <thcipriani@deploy1002> Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: [[gerrit:676350|Revert "Turn on glent m1 AB test"]] T262612 (duration: 00m 58s)

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.37; 2021-03-30); removed MW-1.36-notes (1.36.0-wmf.38; 2021-04-06).Apr 2 2021, 12:00 AM

I'm finally catching up. Wow! It's nice to read through your thought process and experiments, Erik. Things will be well documented this time! Thanks for digging into this and coming up with alternatives. Would putting everything in the backend solve the Vue.js problem, or would the frontend still need some tweaking to do the right thing?

Would putting everything in the backend solve the Vue.js problem, or would the frontend still need some tweaking to do the right thing?

For the SERP page, moving bucket selection to backend will at least resolve the mismatched buckets. It wont resolve autocomplete data collection though. All API requests will still need to explicitly provide the bucket in the url due to response caching. We also pass info about the algorithm used back from the API requests to tracking through headers, it was difficult in the old implementation to get access to the actual headers, we will have to find out here.

Change 693054 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] searchSatisfaction: Remove subTest handling

https://gerrit.wikimedia.org/r/693054

Change 693055 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] searchSatisfaction: remove sampling

https://gerrit.wikimedia.org/r/693055

Change 685621 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Refactor UserTesting to assign test buckets

https://gerrit.wikimedia.org/r/685621

Change 685621 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Refactor UserTesting to assign test buckets

https://gerrit.wikimedia.org/r/685621

Gehel moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.Jun 7 2021, 11:59 AM

Change 693054 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] searchSatisfaction: Remove subTest handling

https://gerrit.wikimedia.org/r/693054

Change 693055 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] searchSatisfaction: remove sampling

https://gerrit.wikimedia.org/r/693055

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.11; 2021-06-21).Jun 10 2021, 5:00 PM

EBernhardson moved this task from In Progress to Waiting on the Discovery-Search (Current work) board.Jun 14 2021, 3:31 PM

Gehel edited projects, added Discovery-Search; removed MW-1.37-notes (1.37.0-wmf.11; 2021-06-21), MW-1.36-notes (1.36.0-wmf.37; 2021-03-30), Patch-For-Review, Discovery-Search (Current work).Nov 22 2021, 4:33 PM

• MPhamWMF triaged this task as High priority.Nov 29 2021, 4:23 PM

• MPhamWMF moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

@EBernhardson: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action... → Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

	F34190910: T262612_num_sessions_by_session_length.png
	Mar 25 2021, 7:55 PM

Run an A/B test using suggestions generated using glent Method 1Open, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Run an A/B test using suggestions generated using glent Method 1
Open, HighPublic
Actions

Related Objects
Search...