Page MenuHomePhabricator

Evaluate the phrase rescore
Closed, ResolvedPublic

Description

Sometimes the phrase rescore seems to be overboosted and can cause the first search results page to be flooded by sub-optimal results.

  • The phrase rescore is applied to the all field which means if a category perfectly matches all its article have good chance to be part of the first result page:
    • We should maybe add a new config where we can set the list of fields where the phrase rescore is applied (text only?)
    • We should review the boost value of 10 which seems to be very high, titles are currently underboosted because of the allfield and a high boost for phrase rescore does not really help.

Event Timeline

dcausse created this task.Feb 25 2016, 1:32 PM
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 25 2016, 1:32 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
dcausse added a comment.EditedMar 9 2016, 1:51 PM

I used the new Erik's engineScore to run some evaluation.
The score seems to confirm our assumptions that a boost of 10 is too high.
Preferred value is around 1. The score also suggests to increase the window size to 650.
Full details: https://docs.google.com/document/d/1upbJo5fB0i2N8C7k_9ySS4iSnjLniRL1zgNon9BuQ7g/edit?usp=sharing
The impact is very minimal from this score point of view but I hope it'll have a larger impact in real.

I'd suggest to either change the value directly without any A/B test and collect new metrics to re-run the engine score or run an A/B test.
Since this score tends to prefer its current settings I'm curious to see what will happen if we re-run this test on the data collected with cirrus configured with a phrase boost of 1. Will it suggest again to decrease the value?

We could run relcomp and extract the top 20 unordered to double check that the impact is not too high so that it can be pushed to prod safely.

As suggested by @TJones I ran another optimisation to make sure that there is no better point between 1 and 10.
According to the last graph it seems clear that the score prefers a lower phrase boost (google doc updated).

As suggested by @EBernhardson we should maybe run an A/B test.

Copy/pasting the analysis run by @TJones concerning the evaluation of the impact.
The change in the engine score is very small but we can see non negligible impact when running other tools in the rel forge:
I'd like to say that this is a good news because this change has a non negligible impact in the top 20: ~20% of the queries have different pages in the top 20.

I ran the default (10) vs wgCirrusSearchPhraseRescoreBoost of 1, and 0.1, and then for fun I compared 1 and 0.1. I was hoping there would be a reasonable gradient of change (i.e., 10 vs 1 + 1 vs 0.1 would be approximately 10 vs 0.1), but it looks like things get pretty mixed up at each step.
10 vs 1
Metrics:

Query Count: 1000
Zero Results: 24.3%
Top 3 Unsorted Results Differ: 12.2%
Top 3 Sorted Results Differ: 14.5%
Top 5 Unsorted Results Differ: 15.1%
Top 5 Sorted Results Differ: 20.5%
Top 20 Unsorted Results Differ: 20.4%
Top 20 Sorted Results Differ: 31.8%

10 vs 0.1
Metrics:

Query Count: 1000
Zero Results: 24.3%
Top 3 Unsorted Results Differ: 26.8%
Top 3 Sorted Results Differ: 28.8%
Top 5 Unsorted Results Differ: 29.6%
Top 5 Sorted Results Differ: 32.1%
Top 20 Unsorted Results Differ: 30.2%
Top 20 Sorted Results Differ: 34.2%

1 vs 0.1
Metrics:

Query Count: 1000
Zero Results: 24.3%
Top 3 Unsorted Results Differ: 22.9%
Top 3 Sorted Results Differ: 25.8%
Top 5 Unsorted Results Differ: 27.6%
Top 5 Sorted Results Differ: 30.6%
Top 20 Unsorted Results Differ: 28.9%
Top 20 Sorted Results Differ: 33.5%
Deskana closed this task as Resolved.Mar 24 2016, 4:33 PM
Deskana added a subscriber: Deskana.

Closing this one. In the end, this led to T129593.