Page MenuHomePhabricator

Analyze results of BM25 AB test
Closed, ResolvedPublic6 Estimated Story Points

Event Timeline

debt triaged this task as Medium priority.Aug 25 2016, 10:11 PM

The plan is to test 5 buckets:

  1. bm25:control

identical to what we serve to our users today plus an artificial latency to compensate the fact that we run the other buckets on another datacenter.
discernatron nDCG@5 score: 0.2772

  1. bm25:allfield

Here we use the same query builder as bm25:control but we switched the similarity function to BM25 and a weighted sum for the incoming links query independant factor. We expect this bucket to behaves poorly in term of click through compared to the control group.
We test this to confirm our assumptions that the current query builder and the allfield approach is not designed for the bm25 similarity.
discernatron nDCG@5 score: 0.2689

  1. bm25:inclinks

Here we switch to a per field query builder using only incoming links as a query independant factor. This is the best contender according to discernatron. We expect an increase in clickthrough because it tends rank obvious matches in the top 3.
discernatron nDCG@5 score: 0.3362

  1. bm25:inclinks_pv

Similar to bm25:inclinks but with we added pageviews as an additional query independent factor, the weight for pageviews is still very low compared to incoming links. The is test is mostly to see how pageviews could affect the ranking. We expect a very minimal difference in behavior compared to bm25:inclinks.
discernatron nDCG@5 score: 0.3359

  1. bm25:inclinks_pv_reverse

Similar to bm25:inclinks_pv with an additional field to track typos in the first 2 chars. Today the "did you mean suggestion" engine is unable to suggest a fix for the query "qlbert einstein". We expect a slight decrease in zero result rates and hopefully an increase in click-through rate. This test is added to measure the benefit of such field, did you mean suggestion are not great and the question here is: will this increase noise and provide more annoying suggestions or will it help our users?
discernatron nDCG@5 score: 0.3359 (this feature can't be really tested with discernatron today)

Overall we should see a slight decrease in ZRR for buckets 3/4/5 because of the new query builder used, ZRR should be almost identical between 1 and 2, if it's not the case it's either a sampling issue or inconsistencies with the two clusters.
We hope to see an increase in click-through for 3/4/5 due to the per field scoring approach, if it's not the case it could probably mean that the tuning done discernatron is not appropriate when applied to real world usage.
Finally we'd like to confirm that we can trust the nDCG scores as a measure for offline testing by seeing the same variation in click-through rate between buckets.

Detailed features

bucketclustersimilaritybuilderQI FactorsQI methodboost templatestitle+redirects ngramsDYM reverse field
bm25:controleqiadlucene tf/idfQueryString allfieldincoming links(similarity+phraseboost)*log10(qi+2)yesnono
bm25:allfieldcodfwBM25QueryString allfieldincoming links(similarity+phraseboost) + ∑(weight*satu(qi factor))nonono
bm25:inclinkscodfwBM25per fieldincoming links(similarity+phraseboost) + ∑(weight*satu(qi factor))noyesno
bm25:inclinks_pvcodfwBM25per fieldincoming links, pageviews(similarity+phraseboost) + ∑(weight*satu(qi factor))noyesno
bm25:inclinks_pv_revcodfwBM25per fieldincoming links, pageviews(similarity+phraseboost) + ∑(weight*satu(qi factor))noyesyes

satu is : valueᵃ/(valueᵃ+kᵃ)
a and k are constants:

bucketinclinks weihtinclinks kinclinks ᵃpageviews weihtpageviews kpageviews ᵃ

Pageviews is stored in the index as: weeklypageviews/sum(pageview for the project), values are very low and it's why pageviews k is so low.
The weights cannot be compared between each others if the query builder is different, bm25:allfield works on a single field while bm25:inclinks is a per field approach thus scores from text features are higher.

mpopov set the point value for this task to 6.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

After I mentioned that I've been using string edit distance and hierarchical clustering to group queries together, Erik suggested that I also look at search results in CirrusSearchRequestSet in Hive. The idea is that queries that share a result are probably reformulations.

To that end, I've imported TSS2 data into Hive and have the following JOIN: (Nothing this for future reference.)

@mpopov Thanks for another great report! Comments sent by email.

@mpopov Great report!!! I've sent comments by email. :)

Second draft up on (thanks @TJones
for the color-coding-in-text idea!)

(Trying something new this time. I put all the feedback into a - [x] task-list format. Changes between 1st and 2nd drafts are documented in

@dcausse Could you please explain the difference between the all-field query builder and the per-field query builder?

@mpopov sure,

The allfield field approach combines raw term frequency and field weights at index time, it creates an artificial field where the content of each field is copied n times:

  • A word in the title is copied 20 times
  • A word in the title is copied 15 times

At query time we use this single all weighted field

The per field builder approach combines scores of individual fields at query time.

Comments on the report:
BM25 is a similarity function of the TF-IDF family, it'd be more exact to state that we ran a test to measure the difference between Okapi BM25 and the lucene classic similarity.

Excellent use of color—even beyond what I was suggesting. Love it!

One minor technical glitch, Under Background | PaulScore, the link to "PaulScore Definition" doesn't work. I tried it in Chrome, Safari, and Firefox. The TOC link works, though, which is weird.

@dcausse @TJones thanks, guys!

Okay, final draft is up. Good job, everyone~