Evaluate whether we get a useful precision boost from ignoring strings below a certain length. May be inconclusive because really short strings (e.g., one character) have already been excluded from training data, but may get some benefit from ignoring shorter strings that are more likely to be difficult/ambiguous.
|Allow specification of minimum length for classification||wikimedia/textcat||master||+72 -9|
|Open||None||T118278 [EPIC] Improve Language Identification for use in Cirrus Search|
|Resolved||TJones||T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection|
|Resolved||TJones||T149318 Add support for limiting min input length for TextCat|
Ugh. I just stumbled across a terrible case of this. Searching for a quotation mark on the Japanese or English Wikipedia suggests a page on Hebrew Wikipedia!
I know why, and it's kinda funny, but this is why we can't have nice things.
It makes sense, in hindsight. On enwiki at least, most bits of punctuation only return one or two results, which means language ID gets run on them. Then it's more or less random which language shows up as a match. Many languages have articles on punctuation, so you get results.
This doesn't happen with strings of emoji, which may be identified as some language but get no hits in the other wiki.
One solution, as here, is to refuse to do language ID on short strings. Another is to not return stuff that's a poor match for all languages (like strings of emoji) as in T149320.
Another option I just thought of is not doing language ID if there is a result that is an exact title or redirect match for the query. That would fix this problem, but could prevent interesting cross-language matches for something that only has one article on the current wiki. I tried to find examples with place names on enwiki, but enwiki is too thorough. Could happen elsewhere, but I don't have an example in hand yet.
I found another corner case I hadn't considered before: if the string is made up entirely of non-word characters (like a number), then it is reduced to the empty string, which scores a perfect "0" for all language models! Usually ar (Arabic) is returned because it's first alphabetically.
As an example, searching enwiki for 14786430410001720363 (part of a DOI number) gets results on arwiki. Ugh.
I'll fix this and upload another patch, and double check that it doesn't affect my earlier analysis in any meaningful way. (I predict no effect, but I'll make sure.)
I've changed the task title to reflect that this doesn't actually implement a minimum query length, but it does add the infrastructure in the code such that a minimum query length can be added very easily in the future.