Page MenuHomePhabricator

Evaluation precision of discernatron results vs our retrieval query
Closed, DeclinedPublic

Description

Discernatron data contains search results from multiple search engines. To evaluate what in our retrival phase is too strict we could evaluate if our retrieval query pulls in the results from those other search engines inside the rescore window (top 8k? top 4k?). Possibly slice the data by aggregate user score on discernatron.

Event Timeline

EDIT: Please note that all the data below is just made up to show the possible formats.

Since the search engines have different recall profiles—with one being famously expansive in its recall—it makes sense to score them independently. We also have the human-scored results for <150 queries, graded on a 0-4 scale. I'm a big fan of tables of numbers, but not everyone is, so feel free to consider paring this down significantly.

For human-scored results, stratify them by score: D‍0 = Discernatron score of 0-0.49; D‍1 = Discernatron score of 0.5-1.49, etc. up to D‍4 = 3.5-4.0. See how many make it into a rescore window of size n ∈ {10, 100, 1024, 2048, 4096, 8192}—or whatever makes sense. 10 and 100 are there out of curiosity to see how the initial ranking does, and 1K, 2K, 4K, and 8K are plausible rescore window sizes. Results could look like this:

D‍4D‍3D‍2D‍1D‍0
1098%88%51%18%2%
100100%93%85%67%37%
1024100%99%94%92%79%
2048100%100%100%100%95%
4096100%100%100%100%98%
8192100%100%100%100%99%

I'd interpret this as: we're getting almost all of the truly great results (D‍4) and lots of the good results (D‍3) in the top 10 before LTR, and almost all in the top 100. Almost everything that matters (D‍2 and up) in in the top 1K and everything remotely plausible (D‍1 and up) is in the top 2K, so 2K is a reasonable rescore window for preserving historical performance—that is, we aren't losing any known-good results, and of course there are plenty more potential good results to go into the LTR rescorer.

Similarly, for a given search engine, S‍1 would be all #1 results across all queries, S‍3 would be all top-3 results, etc, for S‍5, S‍10, and S‍20. S‍10 and S‍20 are mostly for curiosity, since results beyond the top 5 are often much less important.

S‍1S‍3S‍5S‍10S‍20
1096%45%38%30%15%
100100%67%47%36%23%
1024100%99%94%82%78%
2048100%100%100%99%96%
4096100%100%100%100%98%
8192100%100%100%100%98%

Again, a 1K rescore window would include almost everything that's known to be likely good, and 2K really covers everything.

An alternative which generates 50% more numbers (yay!) would be to show a diff table, in addition to the before (baseline) and after (delta) tables. Here they are all combined into one table:

SE#1BaselineDeltaDiff
S‍1S‍3S‍5S‍10S‍20S‍1S‍3S‍5S‍10S‍20S‍1S‍3S‍5S‍10S‍20
1096%45%38%30%15%94%45%42%33%21%-2%0%+4%+3%+6%
100100%67%47%36%23%99%69%58%39%24%-1%+2%+11%+3%+1%
1024100%99%94%82%78%100%99%95%80%79%0%0%+1%+2%+1%
2048100%100%100%99%96%100%99%100%98%94%0%-1%0%-1%-2%
4096100%100%100%100%98%100%100%100%100%97%0%0%0%0%-1%
8192100%100%100%100%98%100%100%100%100%98%0%0%0%0%0%

Overall, that's a lot of numbers: 30 baseline + 30 delta + 30 diff = 90 for the Discernatron data, and then another 90 for each of three search engines, for a grand total of 360 numbers.

I could imagine only looking at D‍3 and D‍4 (separately or combined), and possibly only S‍3, and only the 2K, 4K, and 8K rescore windows—taking it down to 6*3 for Discernatron and 3*3 for each of 3 search engines for only 45 numbers—30 if you drop the diffs, 24 if you combine D‍3 & D‍4 (which I did not):

Disc.BaselineDelta
D‍4D‍3D‍4D‍3
2048100%100%100%99%
4096100%100%100%100%
8192100%100%100%100%
S‍3SE#1DeltaSE#2DeltaSE#3Delta
2048100%99%97%96%85%81%
4096100%100%99%100%97%96%
8192100%100%100%100%99%98%

Another option briefly discussed yesterday was color coding the table cells. Might be overkill, or might make the diff unnecessary for even large tables.

</2¢>

Huh, thats very interesting data! I'm surprised that we have such a high recall on this data. We currently set LTR rescore to 1024, but with 7 shards that means we see aproximately the top 7k results which should contain 98% of the top-20 results returned by other search engines, and 100% of results (in this small set) that graders determined were even a little bit relevant to the query.

It certainly pushes back on my intuition that part of our problem is the recall phase of the query. Certainly there are queries with misspellings and such that fail, but in the general case our recall into the group that LTR is able to rescore is quite good.

@EBernhardson—sorry, I should have clarified.. that's not real data! I just made it up to show the format and the kinds of info we might see.

debt triaged this task as Medium priority.Nov 9 2017, 6:06 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.