Page MenuHomePhabricator

Optimize TextCat maximum returned languages and results ratio
Closed, ResolvedPublic


See notes on TextCat Internal Quality Control for explanations of maximum returned languages and results ratio.

Try different ranges for max returned languages and results ratio, other than the default 5 and 1.05. Should perhaps be set on a case-by-case basis. Look at numerical limits and proportional limits for max returned languages—some sets of languages are so small that it's hard to hit the limit.

If this is consistent across wiki test sets, determine a useful default. Otherwise, add as an option to be found at language optimization time. Update Perl and PHP versions of TextCat.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'm working on this TextCat task next because I was using these parameters as test cases for tool dev, and noticed that there's seems to be some obvious potential improvements to be made here.

Write up with full details is available.


  • There's a very strong pressure to optimize the maximum returned languages to 1—no ambiguity allowed!
  • I didn't intend to consider model size, but I did, and bigger models do better, possibly to the point where we need to go beyond the 5K we currently support in production 9I have up to 10K in dev, and 9K is the current best).
  • The optimal results ratio depends on the model size.

The improvement from optimizing maximum returned languages, results ratio, and model size averages 1.4% to 2.6% in F0.5 score across the nine wikis/corpora (depending on model size), with the poorer performing wikis generally getting bigger improvements. That may seem small, but the "poorly" performing wikis have F0.5 scores in the ~82-83% range, and improve to the ~86-88% range. An improvement from 82% to 88% is a third of the maximum possible improvement (6% improved/18% possible), so it isn't trivial.

Other notes:

  • Re-verified that query-based models are helpful.
  • Accidentally discovered that adding an additional unknown n-gram penalty may improve results. (T151230)
  • Proportional limits don't matter, since it always optimizes downward until it round to 1, so no TextCat code is changed.

Deployment will occur (T149324) after other related tasks are complete and the interplay between them is worked out.

debt added a subscriber: debt.

This is being deployed with the next TextCat deployment, so, resolving as there is no more work to be done.