⚓ T143589 Analyze results of BM25 AB test

		Status	Subtype	Assigned	Task
		Resolved		• Deskana	T143585 Initial BM25 A/B Test
		Resolved		mpopov	T143589 Analyze results of BM25 AB test

debt triaged this task as Medium priority.Aug 25 2016, 10:11 PM

debt removed a project: Discovery-Search (Current work).

Initialized the repo on GH: https://github.com/wikimedia-research/Discovery-Search-Test-BM25

The plan is to test 5 buckets:

bm25:control

identical to what we serve to our users today plus an artificial latency to compensate the fact that we run the other buckets on another datacenter.
discernatron nDCG@5 score: 0.2772

bm25:allfield

Here we use the same query builder as bm25:control but we switched the similarity function to BM25 and a weighted sum for the incoming links query independant factor. We expect this bucket to behaves poorly in term of click through compared to the control group.
We test this to confirm our assumptions that the current query builder and the allfield approach is not designed for the bm25 similarity.
discernatron nDCG@5 score: 0.2689

bm25:inclinks

Here we switch to a per field query builder using only incoming links as a query independant factor. This is the best contender according to discernatron. We expect an increase in clickthrough because it tends rank obvious matches in the top 3.
discernatron nDCG@5 score: 0.3362

bm25:inclinks_pv

Similar to bm25:inclinks but with we added pageviews as an additional query independent factor, the weight for pageviews is still very low compared to incoming links. The is test is mostly to see how pageviews could affect the ranking. We expect a very minimal difference in behavior compared to bm25:inclinks.
discernatron nDCG@5 score: 0.3359

bm25:inclinks_pv_reverse

Similar to bm25:inclinks_pv with an additional field to track typos in the first 2 chars. Today the "did you mean suggestion" engine is unable to suggest a fix for the query "qlbert einstein". We expect a slight decrease in zero result rates and hopefully an increase in click-through rate. This test is added to measure the benefit of such field, did you mean suggestion are not great and the question here is: will this increase noise and provide more annoying suggestions or will it help our users?
discernatron nDCG@5 score: 0.3359 (this feature can't be really tested with discernatron today)

Overall we should see a slight decrease in ZRR for buckets 3/4/5 because of the new query builder used, ZRR should be almost identical between 1 and 2, if it's not the case it's either a sampling issue or inconsistencies with the two clusters.
We hope to see an increase in click-through for 3/4/5 due to the per field scoring approach, if it's not the case it could probably mean that the tuning done discernatron is not appropriate when applied to real world usage.
Finally we'd like to confirm that we can trust the nDCG scores as a measure for offline testing by seeing the same variation in click-through rate between buckets.

TJones subscribed.Aug 29 2016, 1:35 PM

Detailed features

bucket	cluster	similarity	builder	QI Factors	QI method	boost templates	title+redirects ngrams	DYM reverse field
bm25:control	eqiad	lucene tf/idf	QueryString allfield	incoming links	(similarity+phraseboost)*log10(qi+2)	yes	no	no
bm25:allfield	codfw	BM25	QueryString allfield	incoming links	(similarity+phraseboost) + ∑(weight*satu(qi factor))	no	no	no
bm25:inclinks	codfw	BM25	per field	incoming links	(similarity+phraseboost) + ∑(weight*satu(qi factor))	no	yes	no
bm25:inclinks_pv	codfw	BM25	per field	incoming links, pageviews	(similarity+phraseboost) + ∑(weight*satu(qi factor))	no	yes	no
bm25:inclinks_pv_rev	codfw	BM25	per field	incoming links, pageviews	(similarity+phraseboost) + ∑(weight*satu(qi factor))	no	yes	yes

satu is : valueᵃ/(valueᵃ+kᵃ)
a and k are constants:

bucket	inclinks weiht	inclinks k	inclinks ᵃ	pageviews weiht	pageviews k	pageviews ᵃ
bm25:control	N/A	N/A	N/A	N/A	N/A	N/A
bm25:allfield	1.3	30	0.7	0	N/A	N/A
bm25:inclinks	6.5	30	0.7	0	N/A	N/A
bm25:inclinks_pv	5.0	30	0.7	1.5	8e⁻⁶	0.8
bm25:inclinks_pv_rev	5.0	30	0.7	1.5	8e⁻⁶	0.8

Pageviews is stored in the index as: weeklypageviews/sum(pageview for the project), values are very low and it's why pageviews k is so low.
The weights cannot be compared between each others if the query builder is different, bm25:allfield works on a single field while bm25:inclinks is a per field approach thus scores from text features are higher.

mpopov claimed this task.Sep 20 2016, 8:08 PM

mpopov set the point value for this task to 6.

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

mpopov mentioned this in T146337: TSS2 doesn't track pagination & back button.Sep 21 2016, 11:02 PM

After I mentioned that I've been using string edit distance and hierarchical clustering to group queries together, Erik suggested that I also look at search results in CirrusSearchRequestSet in Hive. The idea is that queries that share a result are probably reformulations.

To that end, I've imported TSS2 data into Hive and have the following JOIN: https://phabricator.wikimedia.org/P4095 (Nothing this for future reference.)

First draft up: https://wikimedia-research.github.io/Discovery-Search-Test-BM25/

@TJones, @EBernhardson, @chelsyx, @debt, @TJones Could you please review? (After the weekend, of course!)

mpopov moved this task from In progress to Needs review on the Discovery-Analysis (Current work) board.Oct 1 2016, 12:45 AM

@mpopov Thanks for another great report! Comments sent by email.

@mpopov Great report!!! I've sent comments by email. :)

Second draft up on https://wikimedia-research.github.io/Discovery-Search-Test-BM25/ (thanks @TJones
for the color-coding-in-text idea!)

(Trying something new this time. I put all the feedback into a - [x] task-list format. Changes between 1st and 2nd drafts are documented in https://github.com/wikimedia-research/Discovery-Search-Test-BM25/blob/master/docs/CHANGELOG.md)

@dcausse Could you please explain the difference between the all-field query builder and the per-field query builder?

@mpopov sure,

The allfield field approach combines raw term frequency and field weights at index time, it creates an artificial field where the content of each field is copied n times:

A word in the title is copied 20 times
A word in the title is copied 15 times

At query time we use this single all weighted field

The per field builder approach combines scores of individual fields at query time.

Comments on the report:
BM25 is a similarity function of the TF-IDF family, it'd be more exact to state that we ran a test to measure the difference between Okapi BM25 and the lucene classic similarity.

Excellent use of color—even beyond what I was suggesting. Love it!

One minor technical glitch, Under Background | PaulScore, the link to "PaulScore Definition" doesn't work. I tried it in Chrome, Safari, and Firefox. The TOC link works, though, which is weird.

mpopov moved this task from Needs review to In progress on the Discovery-Analysis (Current work) board.Oct 11 2016, 6:49 PM

@dcausse @TJones thanks, guys!

Okay, final draft is up. Good job, everyone~

mpopov moved this task from In progress to Done on the Discovery-Analysis (Current work) board.Oct 11 2016, 7:08 PM

Thanks, @mpopov !

debt closed this task as Resolved.Oct 11 2016, 8:53 PM

dcausse mentioned this in T139584: EPIC: run an A/B test to evaluate new features involved by enabling BM25.Dec 5 2016, 3:41 PM

Analyze results of BM25 AB test
Closed, ResolvedPublic6 Estimated Story Points
Actions

Related Objects
Search...

Event Timeline

	EBernhardson
	Aug 22 2016, 6:12 PM

Analyze results of BM25 AB testClosed, ResolvedPublic6 Estimated Story PointsActions

Related ObjectsSearch...

Event Timeline

Analyze results of BM25 AB test
Closed, ResolvedPublic6 Estimated Story Points
Actions

Related Objects
Search...