Page MenuHomePhabricator

Build out a basic tool for running search queries and then recording human judgement of result relevance for use in offline engine scoring
Closed, ResolvedPublic

Description

We need some ways to be able to run batches of queries in an off-line manner and evaluate how a change to the engine changed our results. We have some basics built up here in relevance lab but we don't calculate any top level numbers that give a feel for if the results are really better, or just different.

One possibility is to use nDCG, but this requires having a measure of relevancy from 0..3 for the results. We could build out something in relevancy lab to have team members (people with access to PII queries) give these 0..3 scores to results. It might take some time, but we would be able to build up a corpus of query -> result scorings for use in generating metrics.

We might have to put some thought into where to store this data such that we can use it when doing evaluations. Perhaps we can work out something with legal where queries are 'vetted' and available for public disclosure after they have gone through the process. The process would likely need to have an additional level where you can rank a query as PII and remove it from the query set? not sure.

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

This is moving forward, there will be a site up at http://relevance.wmflabs.org for collecting human judgements. This will be open to anyone with a mediawiki.org user account. Currently working with legal on putting together the appropriate notices before making this live.

re: "One possibility is to use nDCG, but this requires having a measure of relevancy from 0..3 for the results."

nDCG can have a variety of number of labels. Eg: http://research.microsoft.com/pubs/80252/fetterly.pdf uses five judgment labels: “Bad”, “Fair”, “Good”, “Excellent” and “Perfect”.

Number of labels is often a trade off. People can label faster w/ binary labels (relevant vs. non-relevant) and you'll get more labels in your set. More judgement levels leads to a finer result and is useful for finer grained comparisons. I generally see 3-5 labels and the extreme of 2 is uncommon.

Deskana assigned this task to EBernhardson.
Deskana subscribed.

Discernatron (https://discernatron.wmflabs.org) has been active for quite some time now. The Discernatron still needs work, but that should be tracked in other tasks; this task, to create the tool, is resolved.