Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Oct 27 2016, 3:50 PM

Description

Evaluate whether we can get a precision boost from ignoring strings that cost closely to the max possible cost (based on number of ngrams). Strings that have few or no characters in common with the language model will get a maximum cost; for example, when running Arabic Text through the English model, only the spaces are found in the model, so everything else gets the max cost. When running a language like Tamil against a collection of models that don't include Tamil, they generally all get close to the max cost. If the best cost is close to the max cost, it's probably not a good match. Find a reasonable threshold (95% of max? 90%? 80%?) that improves precision.

Evaluate this as a score: is it predictive of general performance throughout the range, or only at the high end?

If this is consistent across wiki test sets, determine a useful default. Otherwise, add as an option to be found at language optimization time. Update Perl and PHP versions of TextCat.

Details

	Subject	Repo	Branch	Lines +/-
	Add support for filtering by max proportion of max possible score	wikimedia/textcat	master	+123 -67

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	TJones	T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection
Resolved	TJones	T149320 Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy

Event Timeline

TJones created this task.Oct 27 2016, 3:50 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

debt moved this task from needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM

TJones mentioned this in T149318: Add support for limiting min input length for TextCat.Oct 28 2016, 3:26 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Nov 29 2016, 4:34 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

First draft of the write up is available.

I still need to implement this option in the PHP version and commit the Perl version.

Summary:

Setting a max proportion of max score has a non-negative effect on F0.5, at least with larger model sizes (i.e., 9K). This is mostly from not allowing wrong-character-set queries to be assigned a value for language identification.
The effect on actual junk queries is fairly minimal unless MPMS is rather aggressive (0.75 or below). Despite some corner cases (most addressed by minimum input length), a lot of the worst junk queries will get no results on any wiki, so improperly identifying them as a language is not a huge problem.
We probably could get some F0.5 improvements by setting MPMS by language, but I'm generally trying to avoid doing so for all of the non–language-specific features, both to keep things simple, and to avoid over-training on relatively small individual corpora. However, the one most obvious outlier is the Japanese Wikipedia corpus, which has many more Japanese and Chinese queries than any other corpus. The fact that those languages don't use a fairly constrained alphabet for writing could be relevant. Perhaps two settings, for alphabets and non-alphabets would be warranted. On the other hand, the total fluctuation is always less than 1.5%, and the Japanese corpus already has one of the highest baselines (> 95% F0.5).
I have some hope that a separately defined unknown-ngram penalty could help separate out junk queries.
As a score, MPMS isn't great. It roughly correlates with quality if broken into a small number of buckets, and junk queries tend to score higher/worse than non-junk queries.

Deployment to prod will be part of T149324.

Perl version committed to GitHub.

PHP patch is waiting on T153105 to be done first.

Change 328197 had a related patch set uploaded (by Tjones):
Add support for filtering by max proportion of max possible score

https://gerrit.wikimedia.org/r/328197

gerritbot added a project: Patch-For-Review.Dec 19 2016, 5:07 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Dec 19 2016, 5:32 PM

Change 328197 merged by jenkins-bot:
Add support for filtering by max proportion of max possible score

https://gerrit.wikimedia.org/r/328197

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Dec 19 2016, 6:20 PM

This was already deployed to the TextCat repository and will be deployed in production this week, but not turned on until next week (week of Feb 6).

Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on AccuracyClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy
Closed, ResolvedPublic
Actions

Related Objects
Search...