Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy
Closed, ResolvedPublic

Description

Evaluate whether we can get a precision boost from ignoring strings that cost closely to the max possible cost (based on number of ngrams). Strings that have few or no characters in common with the language model will get a maximum cost; for example, when running Arabic Text through the English model, only the spaces are found in the model, so everything else gets the max cost. When running a language like Tamil against a collection of models that don't include Tamil, they generally all get close to the max cost. If the best cost is close to the max cost, it's probably not a good match. Find a reasonable threshold (95% of max? 90%? 80%?) that improves precision.

Evaluate this as a score: is it predictive of general performance throughout the range, or only at the high end?

If this is consistent across wiki test sets, determine a useful default. Otherwise, add as an option to be found at language optimization time. Update Perl and PHP versions of TextCat.

TJones created this task.Oct 27 2016, 3:50 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt moved this task from Needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM
TJones renamed this task from compare TextCat scores to max cost to Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy.Dec 13 2016, 1:00 AM

First draft of the write up is available.

I still need to implement this option in the PHP version and commit the Perl version.

Summary:

  • Setting a max proportion of max score has a non-negative effect on F0.5, at least with larger model sizes (i.e., 9K). This is mostly from not allowing wrong-character-set queries to be assigned a value for language identification.
  • The effect on actual junk queries is fairly minimal unless MPMS is rather aggressive (0.75 or below). Despite some corner cases (most addressed by minimum input length), a lot of the worst junk queries will get no results on any wiki, so improperly identifying them as a language is not a huge problem.
  • We probably could get some F0.5 improvements by setting MPMS by language, but I'm generally trying to avoid doing so for all of the non–language-specific features, both to keep things simple, and to avoid over-training on relatively small individual corpora. However, the one most obvious outlier is the Japanese Wikipedia corpus, which has many more Japanese and Chinese queries than any other corpus. The fact that those languages don't use a fairly constrained alphabet for writing could be relevant. Perhaps two settings, for alphabets and non-alphabets would be warranted. On the other hand, the total fluctuation is always less than 1.5%, and the Japanese corpus already has one of the highest baselines (> 95% F0.5).
  • I have some hope that a separately defined unknown-ngram penalty could help separate out junk queries.
  • As a score, MPMS isn't great. It roughly correlates with quality if broken into a small number of buckets, and junk queries tend to score higher/worse than non-junk queries.

Deployment to prod will be part of T149324.

Perl version committed to GitHub.

PHP patch is waiting on T153105 to be done first.

Change 328197 had a related patch set uploaded (by Tjones):
Add support for filtering by max proportion of max possible score

https://gerrit.wikimedia.org/r/328197

Change 328197 merged by jenkins-bot:
Add support for filtering by max proportion of max possible score

https://gerrit.wikimedia.org/r/328197

debt closed this task as Resolved.Jan 31 2017, 5:58 PM
debt added a subscriber: debt.

This was already deployed to the TextCat repository and will be deployed in production this week, but not turned on until next week (week of Feb 6).