Page MenuHomePhabricator

Compare search engine nDCG scores to human nDCG scores
Closed, DeclinedPublic

Description

Discernatron is awesome, and we've been able to get some predictive results for BM25 using only the 300 evaluations of 200 result sets that we have so far. However, with the current rates of participation, human evaluation won't scale to additional projects and additional languages. We want to see how well mature search engine results compare to human rankings for nDCG scores.

We could use an a priori scoring heuristic (e.g., first through third results are "relevant", fourth and fifth are "mostly relevant", sixth through tenth are "possibly relevant", and all others are "irrelevant"), or use the current average relevance score for a given engine at a particular rank (e.g., say DDG on average scores 2.4 (somewhere between "relevant" and "mostly relevant") in first position, 1.8 in second position, etc.).

While human-scored results have the highest quality, they are lacking in quantity. Computing and comparing the nDCG scores for the ~200 result sets that have human results will allow us to ascertain whether these higher quantity results are of sufficient quality to be used as a proxy for human annotation when no human annotation is available.

Event Timeline

debt triaged this task as Medium priority.Oct 13 2016, 1:06 AM
debt added a project: Discovery-ARCHIVED.
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt subscribed.

Declining this ticket as we think we have a better solution in this : T147501