Page MenuHomePhabricator

[EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs
Closed, ResolvedPublic

Description

There's an ongoing discussion about whether we should favor recall (maximizing the possibility of getting a result) or precision (maximizing language ID accuracy) in language identification. See "Favoring Recall in Language Identification".

On the one hand, most poor-performing Wikipedia queries in a language are in the language of the Wikipedia where they are submitted, so queries submitted to other wikis may not get results. On the other hand, most poor-performing Wikipedia queries are not "in a language" (i.e., junk queries, names of people, places, and products, etc.), so the effect of searching off-wiki may be dominated by those queries.

The task is to run some experiments with queries using various configurations to see how many queries are run that don't return anything.

The query sets would include a set of queries known to be in some language, and a random set of all queries (including non-language queries). The configurations would include the precision-focused language set (including the language of the wiki), recall-focused language set (excluding the language of the wiki), and allowing only one result from TextCat, and allowing multiple results from TextCat (probably just 2 to start). Queries are available from frwiki, eswiki, itwiki, and dewiki.

I should also generate a comparable set from enwiki and test that; earlier test sets for enwiki used only zero-results queries, not poorly-performing queries, which are those that have fewer than three results.

Event Timeline

debt renamed this task from Estimate the "wasted" computational cost of recall- vs precision-focused configs to [EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs.Jun 16 2016, 10:06 PM
debt claimed this task.

Looks like these ideas have been in other tasks and this one can be closed. Please re-open if this is incorrect.