Add support for limiting min input length for TextCat
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Oct 27 2016, 3:49 PM

Description

Evaluate whether we get a useful precision boost from ignoring strings below a certain length. May be inconclusive because really short strings (e.g., one character) have already been excluded from training data, but may get some benefit from ignoring shorter strings that are more likely to be difficult/ambiguous.

Details

	Subject	Repo	Branch	Lines +/-
	Allow specification of minimum length for classification	wikimedia/textcat	master	+72 -9

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	TJones	T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection
Resolved	TJones	T149318 Add support for limiting min input length for TextCat

Event Timeline

TJones created this task.Oct 27 2016, 3:49 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:49 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

debt moved this task from needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM

Ugh. I just stumbled across a terrible case of this. Searching for a quotation mark on the Japanese or English Wikipedia suggests a page on Hebrew Wikipedia!

I know why, and it's kinda funny, but this is why we can't have nice things.

whoa!

It makes sense, in hindsight. On enwiki at least, most bits of punctuation only return one or two results, which means language ID gets run on them. Then it's more or less random which language shows up as a match. Many languages have articles on punctuation, so you get results.

This doesn't happen with strings of emoji, which may be identified as some language but get no hits in the other wiki.

One solution, as here, is to refuse to do language ID on short strings. Another is to not return stuff that's a poor match for all languages (like strings of emoji) as in T149320.

Another option I just thought of is not doing language ID if there is a result that is an exact title or redirect match for the query. That would fix this problem, but could prevent interesting cross-language matches for something that only has one article on the current wiki. I tried to find examples with place names on enwiki, but enwiki is too thorough. Could happen elsewhere, but I don't have an example in hand yet.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Nov 21 2016, 7:23 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Initial write up is available. The Perl version of TextCat has been updated. Need to make updates to the PHP version. Gerrit should post them here.

Change 323998 had a related patch set uploaded (by Tjones):
Allow specification of minimum length for classification

https://gerrit.wikimedia.org/r/323998

gerritbot added a project: Patch-For-Review.Nov 29 2016, 12:27 AM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Nov 29 2016, 4:33 PM

TJones moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Nov 30 2016, 7:39 PM

I found another corner case I hadn't considered before: if the string is made up entirely of non-word characters (like a number), then it is reduced to the empty string, which scores a perfect "0" for all language models! Usually ar (Arabic) is returned because it's first alphabetically.

As an example, searching enwiki for 14786430410001720363 (part of a DOI number) gets results on arwiki. Ugh.

I'll fix this and upload another patch, and double check that it doesn't affect my earlier analysis in any meaningful way. (I predict no effect, but I'll make sure.)

Change 323998 merged by jenkins-bot:
Allow specification of minimum length for classification

https://gerrit.wikimedia.org/r/323998

TJones moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Dec 9 2016, 2:45 PM

I've changed the task title to reflect that this doesn't actually implement a minimum query length, but it does add the infrastructure in the code such that a minimum query length can be added very easily in the future.

• Deskana closed this task as Resolved.Dec 9 2016, 3:28 PM

Add support for limiting min input length for TextCatClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add support for limiting min input length for TextCat
Closed, ResolvedPublic
Actions

Related Objects
Search...