Optimize TextCat maximum returned languages and results ratio
Closed, ResolvedPublic

Description

See notes on TextCat Internal Quality Control for explanations of maximum returned languages and results ratio.

Try different ranges for max returned languages and results ratio, other than the default 5 and 1.05. Should perhaps be set on a case-by-case basis. Look at numerical limits and proportional limits for max returned languages—some sets of languages are so small that it's hard to hit the limit.

If this is consistent across wiki test sets, determine a useful default. Otherwise, add as an option to be found at language optimization time. Update Perl and PHP versions of TextCat.

TJones created this task.Oct 27 2016, 3:54 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt moved this task from Needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM
TJones moved this task from Up Next to Current work on the Discovery-Search board.Nov 14 2016, 7:50 PM
TJones moved this task from Backlog to In progress on the Discovery-Search (Current work) board.

I'm working on this TextCat task next because I was using these parameters as test cases for tool dev, and noticed that there's seems to be some obvious potential improvements to be made here.

TJones added a comment.EditedNov 21 2016, 7:13 PM

Write up with full details is available.

Summary:

  • There's a very strong pressure to optimize the maximum returned languages to 1—no ambiguity allowed!
  • I didn't intend to consider model size, but I did, and bigger models do better, possibly to the point where we need to go beyond the 5K we currently support in production 9I have up to 10K in dev, and 9K is the current best).
  • The optimal results ratio depends on the model size.

The improvement from optimizing maximum returned languages, results ratio, and model size averages 1.4% to 2.6% in F0.5 score across the nine wikis/corpora (depending on model size), with the poorer performing wikis generally getting bigger improvements. That may seem small, but the "poorly" performing wikis have F0.5 scores in the ~82-83% range, and improve to the ~86-88% range. An improvement from 82% to 88% is a third of the maximum possible improvement (6% improved/18% possible), so it isn't trivial.

Other notes:

  • Re-verified that query-based models are helpful.
  • Accidentally discovered that adding an additional unknown n-gram penalty may improve results. (T151230)
  • Proportional limits don't matter, since it always optimizes downward until it round to 1, so no TextCat code is changed.

Deployment will occur (T149324) after other related tasks are complete and the interplay between them is worked out.

debt closed this task as Resolved.Jan 31 2017, 5:53 PM
debt added a subscriber: debt.

This is being deployed with the next TextCat deployment, so, resolving as there is no more work to be done.