[EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	May 23 2016, 7:37 PM

Description

There's an ongoing discussion about whether we should favor recall (maximizing the possibility of getting a result) or precision (maximizing language ID accuracy) in language identification. See "Favoring Recall in Language Identification".

On the one hand, most poor-performing Wikipedia queries in a language are in the language of the Wikipedia where they are submitted, so queries submitted to other wikis may not get results. On the other hand, most poor-performing Wikipedia queries are not "in a language" (i.e., junk queries, names of people, places, and products, etc.), so the effect of searching off-wiki may be dominated by those queries.

The task is to run some experiments with queries using various configurations to see how many queries are run that don't return anything.

The query sets would include a set of queries known to be in some language, and a random set of all queries (including non-language queries). The configurations would include the precision-focused language set (including the language of the wiki), recall-focused language set (excluding the language of the wiki), and allowing only one result from TextCat, and allowing multiple results from TextCat (probably just 2 to start). Queries are available from frwiki, eswiki, itwiki, and dewiki.

I should also generate a comparable set from enwiki and test that; earlier test sets for enwiki used only zero-results queries, not poorly-performing queries, which are those that have fewer than three results.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
		Resolved		debt	T136034 [EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs

Event Timeline

TJones created this task.May 23 2016, 7:37 PM

Restricted Application added a subscriber: Zppix. · View Herald TranscriptMay 23 2016, 7:37 PM

TJones mentioned this in T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification.May 25 2016, 2:36 PM

TJones mentioned this in T134431: Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall.May 26 2016, 1:39 PM

debt subscribed.May 26 2016, 2:45 PM

debt added a project: Discovery-Search (Current work).May 31 2016, 10:10 PM

debt renamed this task from Estimate the "wasted" computational cost of recall- vs precision-focused configs to [EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs.Jun 16 2016, 10:06 PM

Looks like these ideas have been in other tasks and this one can be closed. Please re-open if this is incorrect.

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.May 6 2019, 4:06 PM

[EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[EPIC] Estimate the "wasted" computational cost of recall- vs precision-focused configs
Closed, ResolvedPublic
Actions

Related Objects
Search...