Page MenuHomePhabricator

Run an A/B test using suggestions generated using glent Method 1
Open, Needs TriagePublic

Description

User Story: As a developer, I want to know that Glent Method 1 gives suggestions that users will use so that we can deploy it to production with confidence.

Enable an A/B test using M‍1 so that we can compare number of suggestions and clickthrough rates to M0 and the phrase suggester in order to make a decision.

Acceptance Criteria:

  • We have enough data from the A/B test to be able to decide whether Method 1 is worth deploying.

Event Timeline

We are waiting to discuss T262845 and determine whether that ticket needs to be completed before this one.

Change 672564 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672564

Change 672565 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/mediawiki-config@master] Add Cirrus testing profile for glent m1

https://gerrit.wikimedia.org/r/672565

Change 672566 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/WikimediaEvents@master] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/672566

The above set of patches should start the test. The first two we should merge and deploy soon-ish, before actually starting the test we will want to run some test queries against prod and make sure it looks as we expect.

Change 672564 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672564

Change 672565 merged by jenkins-bot:
[operations/mediawiki-config@master] Add Cirrus testing profile for glent m1

https://gerrit.wikimedia.org/r/672565

Mentioned in SAL (#wikimedia-operations) [2021-03-16T23:31:58Z] <krinkle@deploy1002> Synchronized wmf-config/InitialiseSettings.php: I1ca4f30c2, T262612 (duration: 00m 57s)

Change 672825 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.36.0-wmf.35] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672825

Change 672825 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.36.0-wmf.35] Add fallback profile including glent m1

https://gerrit.wikimedia.org/r/672825

Change 672566 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/672566

Change 674115 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.35] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/674115

Change 674115 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.35] Turn on glent m1 AB test

https://gerrit.wikimedia.org/r/674115

Mentioned in SAL (#wikimedia-operations) [2021-03-22T23:18:56Z] <ebernhardson@deploy1002> Synchronized php-1.36.0-wmf.35/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: T262612: Start glent m1 ab test (duration: 01m 53s)

Test is started, results will be found in the superset Search Query Suggestions dashboard. Data is loaded into this dashboard daily, with the prior days data arriving around 3:00 UTC. The test will be run for 7 days, assuming data collection looks reasonable that means turning off next Monday.

Data is suspicious. The mismatch bucket, which has searches where the testing bucket reported by the backend is different than the frontend expected, Is 44% of all search requests. The backend aggregation looks to be a bit optimistic here as well, the reported bucket is whichever test it saw first (unordered) on a per-session basis, rather than if that particular search reported a mismatch. I'm currently testing a patch that will mark a full session as mismatch if any event in the session is a mismatch, should hopefully get a better idea of the scope of the issue.

Counting mismatches as any query in a session that contains mismatched events, we have 42% of sessions and 52% of search requests falling into the mismatched bucket. In some testing in an incognito window, by the time i figured out how to set the breakpoint inside searchSatisfaction.js my subTest was already set to mismatched. Clearly we need to dig into the search satisfaction tracking and figure out whats going on here if we want to have usable AB test results.

I've been able to enter a mismatched state from incongnito windows multiple times now, but it's not clear what the trigger is. It seems we have two different options: We could try and fix the frontend bucketing, or we previously implemented bucketing in the backend as well. But for some reason i can't remember we quickly transistioned from the backend doing bucketing to doing the bucketing inside the frontend browser code. Perhaps the problem was that the only way thread arbitrary extra data like a bucket through api responses is to inject them into headers, or something like that (but we figured out something, which is how opensearch can still tell search satisfaction which implementation returned results).

That particular backend bucketing code looks straight forward enough, might be plausible to use here.

Looking at this from more of an events/stats perspective, what can we see is different in the mismatched sessions? I first noticed that for automatically rewritten queries mismatched sessions only see 10% with interaction, but the control and test bucket are around 30%. Similarly mismatch sessions are only rewriting 45% of zero results queries, while the test and control buckets are seeing closer to 60% rewrite rates.

Poking at the collected data for one day, the number of searches per session looks consistent but plotting the number of sessions by session length is suspicious with s a big bump (on log scale!) in it for sessions ending at around 100 seconds in the mismatch bucket.

There is something categorically different about the sessions in the mismatch bucket, but it's really not clear what that might be.

Quickly looked when I saw that frwiki has the new search widget enabled but not dewiki/enwiki. Looking at the data it seems frwiki is heavily affected (~20% of the sessions have an event in mismatch or invalid as opposed to 1%/2% for other wikis):

(period 2021-03-23T00:00:00 to 2021-03-23T04:00:00)

	total_sessions	invalid_sessions	pct_invalid
wiki			
+------+--------------+----------------+-----------+
|  wiki|total_sessions|invalid_sessions|pct_invalid|
+------+--------------+----------------+-----------+
|dewiki|        874925|           16805|       1.92|
|enwiki|       4442523|           98517|       2.22|
|frwiki|         33366|            6423|      19.25|
+------+--------------+----------------+-----------+

I think one of the reason bucketing was done in the frontend was to better detect the search session boundaries, doing this on the backend without a state per identity you would have to set arbitrary boundaries I think.

Related to frwiki the autocomplete events are completely off to what we usually get (7x more fulltext events than autocomplete), suggesting that the new widget is to blame. I'd suggest ignoring frwiki for this test.

Quickly looked when I saw that frwiki has the new search widget enabled but not dewiki/enwiki. Looking at the data it seems frwiki is heavily affected (~20% of the sessions have an event in mismatch or invalid as opposed to 1%/2% for other wikis):

(period 2021-03-23T00:00:00 to 2021-03-23T04:00:00)

	total_sessions	invalid_sessions	pct_invalid
wiki			
+------+--------------+----------------+-----------+
|  wiki|total_sessions|invalid_sessions|pct_invalid|
+------+--------------+----------------+-----------+
|dewiki|        874925|           16805|       1.92|
|enwiki|       4442523|           98517|       2.22|
|frwiki|         33366|            6423|      19.25|
+------+--------------+----------------+-----------+

It looks like this came from event.searchsatisfaction? Suspecting that is also including autocomplete, when only considering the full text (discovery.search_satisfaction_daily) i get a breakdown like:

+--------------------+------+-------------------------------+
|              bucket|  wiki|count(DISTINCT searchSessionId)|
+--------------------+------+-------------------------------+
|T262612_glent_m01...|dewiki|                          29713|
|T262612_glent_m01...|dewiki|                          29647|
|            mismatch|dewiki|                          27817|
|T262612_glent_m01...|enwiki|                         113547|
|T262612_glent_m01...|enwiki|                         112501|
|            mismatch|enwiki|                         132518|
|T262612_glent_m01...|frwiki|                            939|
|T262612_glent_m01...|frwiki|                            860|
|            mismatch|frwiki|                          54936|
+--------------------+------+-------------------------------+

frwiki is clearly broken, but en and de are also showing huge amounts of mismatches. I suppose in a way this makes sense because we don't detect mismatch in the autocomplete tracking, that only gets set if the user lands on Special:Search.

Overall the frwiki thing does seem like a major bug we will have to deal with, but i'm not sure the remaining wikis are still something we can trust the outputs of.

Poking at the collected data for one day, the number of searches per session looks consistent but plotting the number of sessions by session length is suspicious with s a big bump (on log scale!) in it for sessions ending at around 100 seconds in the mismatch bucket.

Not sure what these are still, but looking by wiki I can see that this bump comes exclusively from enwiki, fr and de have nice smooth declines in the session length metric. I don't really know that the decline should be smooth, but the others are and it seems likely it would.

David suggested that starting a session without going through autocomplete could perhaps be a source of problems. Looking into specifically sessions starting on enwik iand dewiki, of the sessions that have mismatch events ~75% of those sessions have a mismatch as the first event we see. Of the sessions that have an autocomplete dt prior to a fulltext dt (filtering ac-only), it looks like only 13% of those sessions transition into the mismatch state.

So almost certainly we should be looking closer at session starts. Perhaps could also look at handling when a new session starts after the previous session times out.

I took a sample of 30 complete sessions, joined against cirrus backend logs for information like referers and actual query strings. I reviewed this for inconsistencies, then tried to calculate some ballpark stats about how prevalent those inconsistencies are in the full dataset of sessions that are in a mismatch state. The most common are:

  • Search queries that contain sourceid=Mozilla-search in the query string
  • We aren't currently testing autocomplete, but worth noting most autocomplete requests don't seem to include the cirrusUserTesting parameter
  • Searches referred from www.wikipedia.org
  • Sessions where skin autocomplete correctly submits user testing, but then submitting with fulltext=1 (so Special:Search, basically) without the cirrusUserTesting parameter

The first part to clean up, perhaps the most obvious, is sessions that start "somewhere else". We can't have any guarante that users landing on Special:Search will have come from an approved place that sets our testing query parameters. I'm going to spend a little more time thinking through/testing how we can do bucket assignment in the backend and have the frontend only report the chosen buckets. The first place of guaranteed interaction is the backend code, it seems trying to do this in the frontend is only going to be error prone.

Tracing things back through git history, end up at 1fcba848 from T121542 adding the trigger functionality. The commit message justifys adding ab testing to frontend because textcat was going to need some special query parameters, and this allowed the frontend to provide the single testing parameter instead of using various cirrus debug query params. I found this a bit unclear, so pieced together a bit more of the history.

Reading a bit deeper into the ticket and reading ancient WikimediaEvents git history, it looks like at the time we ran tests in the backend only. When we did AB testing prior to that with the backend component only we analyzed results by joining cirrus logs + webreq logs (same data that feeds mjolnir today). This strategy couldn't capture click throughs to other subdomains, so we adjusted the frontend logging we had to also do AB testing. Unfortunately I couldn't find anything written down on exactly why we put the bucketing in the frontend, instead of having the backend report the bucket to the frontend.

I think one of the reason bucketing was done in the frontend was to better detect the search session boundaries, doing this on the backend without a state per identity you would have to set arbitrary boundaries I think.

I'm thinking the backend wouldn't know anything about the session, it would assign a random constant bucket using the user identity as a seed. This would happen for all requests regardless of source. For the frontend code I'm thinking we can do a request once at the beginning of a session to ask the api what bucket we are in, and then store that in the session much like we store the bucketing decision now. Likely we can fit that data into a cirrus-config-dump api response, perhaps with an api flag that returns only that piece of information.

This doesn't fully work though, in particular autocomplete and morelike requests are cached in edge POPs. We can potentially ignore morelike, but autocomplete only works because we set cirrusUserTesting=control as part of the request. With this set responses are cached per-bucket. If we were to stop sending cirrusUserTesting and let the backend vary autocomplete parameters we would have users receiving cached responses for the wrong bucket.

I don't have a solution for the autocomplete problem. Perhaps we need a hybrid solution where buckets are constantly assigned from the backend, and Special:Search auto-magically uses the bucket, but autocomplete requests will still have to include the query string parameter. This feels messy, will ponder more.

I don't have a solution for the autocomplete problem. Perhaps we need a hybrid solution where buckets are constantly assigned from the backend, and Special:Search auto-magically uses the bucket, but autocomplete requests will still have to include the query string parameter. This feels messy, will ponder more.

The problem is as long as the backend is calculating the bucket we then have to somehow run one request through the caches and down to php for every user. This problem is compounded by significant numbers of sessions only interacting with autocomplete. If we wait, like we currently do, until the user interacts with the search box to initialize the session we would have to round trip to virginia to get the bucket before we can issue the first autocomplete request. If we pre-initialize the session to know the bucket before the user shows intent to type we have to do that for basically every potential user, seems like a non-starter.

What we could do is allow sessions to start without a bucket, and then add it to the logging once we know what it is. This doesn't prevent the frontend from sampling, but it does mean that unsampled requests will still get results modified by buckets the backend enabled for them.

  • index.php requests get automatic bucket assignment, ignores cirrusUserTesting query param. The response already includes CirrusSearchBackendUserTests in the javascript config when CirrusSearch is invoked. afaik none of the ways to invoke cirrussearch through index.php are cacheable, but we should verify that is true and perhaps set an explicit no-cache flag on the output.
  • api.php requests get no automatic bucket assignment. Continues to apply cirrusUserTesting param.
  • Add CirrrusSearchBackendUserTests, mimicking js config, to cirrus-config-dump api call. Add a compact option that only returns this value. This call will explicitly enable the auto-assignment of buckets, since api.php doesn't by default
  • When initializing a session from Special:Search store the reported bucket in session state for use in future autocomplete requests
  • When initializing a session from autocomplete send the request to cirrus-config-dump in parallel to autocomplete requests. Log events with some not-initialized-yet marker. Once the config request returns store in same browser side session state.

The not-initialized-yet marker will make analysis a bit more annoying, depending on purpose you have to decide between discarding those events or backfilling the subTest from later in the session. It seems plausible we can backfill in most occasions, but it's an extra step. There are separate complications that autocomplete doesn't seem to be attaching cirrusUserTesting to autocomplete requests, and frwiki with the vue.js refresh isn't sending cirrusUserTesting anywhere (in a quick test, more rigorous analysis could be performed). Basically its a bit of work, but without first fixing autocomplete data collection it's hard to verify through data that it works as intended.

Change 676158 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676158

Realized the superset dashboard doesn't break down any stats by wiki, and adding that isn't particularly easy. The most important stat is probably the prevalence of mismatch sessions, here is a quick breakdown from hive for a single day:

bucketwikicount(DISTINCT searchSessionId)
T262612_glent_m01:controldewiki29713
T262612_glent_m01:glent_m01dewiki29647
mismatchdewiki27817
T262612_glent_m01:controlenwiki113547
T262612_glent_m01:glent_m01enwiki112501
mismatchenwiki132518
T262612_glent_m01:controlfrwiki939
T262612_glent_m01:glent_m01frwiki860
mismatchfrwiki54936

Change 676158 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676158

Change 676350 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.37] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676350

Change 676351 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.36] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676351

Change 676351 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.36] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676351

Change 676350 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.36.0-wmf.37] Revert "Turn on glent m1 AB test"

https://gerrit.wikimedia.org/r/676350

Mentioned in SAL (#wikimedia-operations) [2021-04-01T23:32:29Z] <thcipriani@deploy1002> Synchronized php-1.36.0-wmf.37/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js: Backport: [[gerrit:676350|Revert "Turn on glent m1 AB test"]] T262612 (duration: 00m 58s)

I'm finally catching up. Wow! It's nice to read through your thought process and experiments, Erik. Things will be well documented this time! Thanks for digging into this and coming up with alternatives. Would putting everything in the backend solve the Vue.js problem, or would the frontend still need some tweaking to do the right thing?

Would putting everything in the backend solve the Vue.js problem, or would the frontend still need some tweaking to do the right thing?

For the SERP page, moving bucket selection to backend will at least resolve the mismatched buckets. It wont resolve autocomplete data collection though. All API requests will still need to explicitly provide the bucket in the url due to response caching. We also pass info about the algorithm used back from the API requests to tracking through headers, it was difficult in the old implementation to get access to the actual headers, we will have to find out here.