Evaluate whether we can get a precision boost from ignoring strings that cost closely to the max possible cost (based on number of ngrams). Strings that have few or no characters in common with the language model will get a maximum cost; for example, when running Arabic Text through the English model, only the spaces are found in the model, so everything else gets the max cost. When running a language like Tamil against a collection of models that don't include Tamil, they generally all get close to the max cost. If the best cost is close to the max cost, it's probably not a good match. Find a reasonable threshold (95% of max? 90%? 80%?) that improves precision.
Evaluate this as a score: is it predictive of general performance throughout the range, or only at the high end?
If this is consistent across wiki test sets, determine a useful default. Otherwise, add as an option to be found at language optimization time. Update Perl and PHP versions of TextCat.