Page MenuHomePhabricator

Add support for limiting min input length for TextCat
Closed, ResolvedPublic

Description

Evaluate whether we get a useful precision boost from ignoring strings below a certain length. May be inconclusive because really short strings (e.g., one character) have already been excluded from training data, but may get some benefit from ignoring shorter strings that are more likely to be difficult/ambiguous.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ugh. I just stumbled across a terrible case of this. Searching for a quotation mark on the Japanese or English Wikipedia suggests a page on Hebrew Wikipedia!

I know why, and it's kinda funny, but this is why we can't have nice things.

It makes sense, in hindsight. On enwiki at least, most bits of punctuation only return one or two results, which means language ID gets run on them. Then it's more or less random which language shows up as a match. Many languages have articles on punctuation, so you get results.

This doesn't happen with strings of emoji, which may be identified as some language but get no hits in the other wiki.

One solution, as here, is to refuse to do language ID on short strings. Another is to not return stuff that's a poor match for all languages (like strings of emoji) as in T149320.

Another option I just thought of is not doing language ID if there is a result that is an exact title or redirect match for the query. That would fix this problem, but could prevent interesting cross-language matches for something that only has one article on the current wiki. I tried to find examples with place names on enwiki, but enwiki is too thorough. Could happen elsewhere, but I don't have an example in hand yet.

Initial write up is available. The Perl version of TextCat has been updated. Need to make updates to the PHP version. Gerrit should post them here.

Change 323998 had a related patch set uploaded (by Tjones):
Allow specification of minimum length for classification

https://gerrit.wikimedia.org/r/323998

I found another corner case I hadn't considered before: if the string is made up entirely of non-word characters (like a number), then it is reduced to the empty string, which scores a perfect "0" for all language models! Usually ar (Arabic) is returned because it's first alphabetically.

As an example, searching enwiki for 14786430410001720363 (part of a DOI number) gets results on arwiki. Ugh.

I'll fix this and upload another patch, and double check that it doesn't affect my earlier analysis in any meaningful way. (I predict no effect, but I'll make sure.)

Change 323998 merged by jenkins-bot:
Allow specification of minimum length for classification

https://gerrit.wikimedia.org/r/323998

Deskana renamed this task from limit min input length for TextCat to Add support for limiting min input length for TextCat.Dec 9 2016, 3:27 PM
Deskana subscribed.

I've changed the task title to reflect that this doesn't actually implement a minimum query length, but it does add the infrastructure in the code such that a minimum query length can be added very easily in the future.