Optimize TextCat maximum returned languages and results ratio
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Oct 27 2016, 3:54 PM

Description

See notes on TextCat Internal Quality Control for explanations of maximum returned languages and results ratio.

Try different ranges for max returned languages and results ratio, other than the default 5 and 1.05. Should perhaps be set on a case-by-case basis. Look at numerical limits and proportional limits for max returned languages—some sets of languages are so small that it's hard to hit the limit.

If this is consistent across wiki test sets, determine a useful default. Otherwise, add as an option to be found at language optimization time. Update Perl and PHP versions of TextCat.

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	TJones	T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection
Resolved	TJones	T149321 Optimize TextCat maximum returned languages and results ratio

Event Timeline

TJones created this task.Oct 27 2016, 3:54 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

debt moved this task from needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM

TJones moved this task from Up Next to Current work on the Discovery-Search board.Nov 14 2016, 7:50 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I'm working on this TextCat task next because I was using these parameters as test cases for tool dev, and noticed that there's seems to be some obvious potential improvements to be made here.

Write up with full details is available.

Summary:

There's a very strong pressure to optimize the maximum returned languages to 1—no ambiguity allowed!
I didn't intend to consider model size, but I did, and bigger models do better, possibly to the point where we need to go beyond the 5K we currently support in production 9I have up to 10K in dev, and 9K is the current best).
The optimal results ratio depends on the model size.

The improvement from optimizing maximum returned languages, results ratio, and model size averages 1.4% to 2.6% in F0.5 score across the nine wikis/corpora (depending on model size), with the poorer performing wikis generally getting bigger improvements. That may seem small, but the "poorly" performing wikis have F0.5 scores in the ~82-83% range, and improve to the ~86-88% range. An improvement from 82% to 88% is a third of the maximum possible improvement (6% improved/18% possible), so it isn't trivial.

Other notes:

Re-verified that query-based models are helpful.
Accidentally discovered that adding an additional unknown n-gram penalty may improve results. (T151230)
Proportional limits don't matter, since it always optimizes downward until it round to 1, so no TextCat code is changed.

Deployment will occur (T149324) after other related tasks are complete and the interplay between them is worked out.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Nov 21 2016, 7:13 PM

TJones mentioned this in T151230: Consider Additional Unknown n-gram Penalty.Nov 21 2016, 7:17 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Dec 8 2016, 11:48 PM

This is being deployed with the next TextCat deployment, so, resolving as there is no more work to be done.

Optimize TextCat maximum returned languages and results ratioClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Optimize TextCat maximum returned languages and results ratio
Closed, ResolvedPublic
Actions

Related Objects
Search...