Investigate Improvements and Confidence Measures for TextCat Language Detection
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Jul 13 2016, 7:34 PM

Description

While mulling this over in the back of my mind while working on other stuff for several weeks, it has morphed into several ways to improve some of the shortcomings of TextCat, as well as ways to report a somewhat meaningful score (which TextCat be default isn't really well-equipped to do).

So, I've converted this to an EPIC and I'll add a number of sub-tasks to it.

I've kept the original description below.

-=-=-=-=-=-
Some potential measures of confidence that could be applied to TextCat are laid out in "TextCat and Confidence"; briefly, we could consider the number of language guesses TextCat gives, compare the score of the second-best guess to the best guess, or compare a score to the theoretical worst possible score for a given string.

It's possible to test any (or all) of those measures against existing query sets and try to find a threshold for better/worse language identification performance.

Possible corpora for testing:

Trey's language evaluation sets: the corpora are small, but we have a reasonable approximation of "the truth" for each identification.
Mikhail's TextCat A/B test data: the corpora are probably larger (not sure on the breakdown by wiki), and a lot more data can be gotten relatively easily without a lot of manual work to get it; evaluation could be based on clicks, which depend not only on language ID, but also whether there's anything relevant in the interwiki results.
Randomly selected queries: We could identify them by language, and the number of results from a cross-wiki search could be used as a proxy for language identification quality. Data is effectively unlimited and easy to get, though evaluation would be less direct.

Potential problems include:

general sparsity of data for evaluation
skews in the presence of click-worthy results by language (e.g., people searching in Spanish on enwiki get more/better results than people searching in German, even though language detection works equally well in either case)

Implementation notes (note to future selves):

It seems likely that the confidence threshold should be set on a per-wiki basis, based on the concerns expressed in "TextCat and Confidence" and elsewhere. (Though it's possible that they could all work out to be really similar.)
The confidence thresholding could happen internally to TextCat, based on a threshold parameter, since it already fails to return a value when too many candidates are likely (by default: >5, within 5%).

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	TJones	T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection
Resolved	TJones	T149314 update Trey’s lang ID evaluation tools
Resolved	TJones	T149316 allow TextCat to use multiple language model directories
Resolved	TJones	T149318 Add support for limiting min input length for TextCat
Resolved	TJones	T149320 Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy
Resolved	TJones	T149321 Optimize TextCat maximum returned languages and results ratio
Resolved	TJones	T149322 Bucketing & Bonuses for TextCat
Resolved	TJones	T149323 Qualitative confidence score for TextCat
Resolved	TJones	T149324 TextCat Improvement Deployment
Resolved	TJones	T155672 Deploy 10K models for TextCat (PHP & Perl)
Resolved	TJones	T151230 Consider Additional Unknown n-gram Penalty
Resolved	TJones	T153105 Refactor TextCat for ambiguity detection and add additional params
Declined	None	T155670 Investigate Ratio of First to Second Result Scores as a Confidence Measure

Event Timeline

TJones created this task.Jul 13 2016, 7:34 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 13 2016, 7:34 PM

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Jul 13 2016, 7:36 PM

TJones added subscribers: debt, mpopov, EBernhardson.

TJones added projects: Discovery-Search, Discovery-ARCHIVED, CirrusSearch.Jul 13 2016, 7:45 PM

This sounds like something that is very good to do! :)

Let's do this sooner rather than later as it might give good insight to other tickets! :)

comments based on a conversation with @TJones:

It’s straightforward to run a few tests using a few possible confidence measures. We have the language evaluation sets (small, but manually categorized) and @mpopov's A/B Test data (large but categories are only heuristic: clicked/unclicked).
If a confidence measure gives improved results for either case, it’s worth pursuing.

TJones mentioned this in T142413: Deploy recommended languages for Russian, Japanese, and Portuguese.Aug 19 2016, 1:33 PM

TJones mentioned this in T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis.Aug 29 2016, 1:54 PM

TJones mentioned this in T142140: Lang ID Eval Set for Dutch.Sep 8 2016, 8:53 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.Sep 12 2016, 3:59 PM

TJones claimed this task.Sep 16 2016, 4:09 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

TJones renamed this task from Investigate Confidence Measures for TextCat Language Detection to Investigate Improvements and Confidence Measures for TextCat Language Detection.Oct 27 2016, 3:43 PM

TJones added a project: Epic.

TJones updated the task description. (Show Details)

TJones created subtask T149314: update Trey’s lang ID evaluation tools.Oct 27 2016, 3:45 PM

TJones created subtask T149316: allow TextCat to use multiple language model directories.Oct 27 2016, 3:47 PM

TJones created subtask T149318: Add support for limiting min input length for TextCat.

TJones created subtask T149320: Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy.Oct 27 2016, 3:50 PM

TJones created subtask T149321: Optimize TextCat maximum returned languages and results ratio.Oct 27 2016, 3:54 PM

TJones created subtask T149322: Bucketing & Bonuses for TextCat.

TJones created subtask T149323: Qualitative confidence score for TextCat.Oct 27 2016, 3:57 PM

TJones created subtask T149324: TextCat Improvement Deployment.

TJones closed subtask T149314: update Trey’s lang ID evaluation tools as Resolved.Nov 14 2016, 7:11 PM

TJones moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.Nov 14 2016, 7:51 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

TJones created subtask T151230: Consider Additional Unknown n-gram Penalty.Nov 21 2016, 7:17 PM

TJones mentioned this in T151796: Should usage of textcat be tracked in the search satisfaction schema?.Nov 29 2016, 9:26 PM

Liuxinyu970226 subscribed.Dec 6 2016, 11:12 AM

• Deskana closed subtask T149316: allow TextCat to use multiple language model directories as Resolved.Dec 8 2016, 6:52 PM

• Deskana closed subtask T149318: Add support for limiting min input length for TextCat as Resolved.Dec 9 2016, 3:28 PM

TJones added a subtask: T153105: Refactor TextCat for ambiguity detection and add additional params.Dec 14 2016, 3:43 PM

TJones moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.Dec 16 2016, 9:13 PM

• Deskana closed subtask T153105: Refactor TextCat for ambiguity detection and add additional params as Resolved.Jan 6 2017, 5:41 PM

TJones updated the task description. (Show Details)Jan 18 2017, 9:28 PM

TJones created subtask T155670: Investigate Ratio of First to Second Result Scores as a Confidence Measure.Jan 18 2017, 9:47 PM

TJones mentioned this in T155672: Deploy 10K models for TextCat (PHP & Perl).Jan 18 2017, 9:54 PM

debt closed subtask T149321: Optimize TextCat maximum returned languages and results ratio as Resolved.Jan 31 2017, 5:53 PM

debt closed subtask T151230: Consider Additional Unknown n-gram Penalty as Resolved.

debt closed subtask T149320: Implement Ability to Compare TextCat Scores to Max Cost and Analyze Effect on Accuracy as Resolved.Jan 31 2017, 5:58 PM

TJones removed TJones as the assignee of this task.Feb 8 2017, 4:26 PM

• Deskana closed subtask T149324: TextCat Improvement Deployment as Resolved.Feb 10 2017, 5:22 PM

• Deskana closed subtask T149322: Bucketing & Bonuses for TextCat as Resolved.Feb 10 2017, 5:27 PM

This task requires the attention of @TJones, and given the efforts we're devoting to language analyser work which we've deemed more important, I'm moving this task out of the sprint to reflect the reality of the situation that this won't be worked on soon.

greg mentioned this in T160715: Single task appearing in two columns on a workboard.Mar 16 2017, 10:34 PM

• mmodell moved this task to needs triage on the Discovery-Search board.Mar 20 2017, 12:41 PM

• mmodell moved this task from needs triage to search-icebox on the Discovery-Search board.

Closing this because after looking into it a while back I decided that internal confidence isn't really a thing for TextCat to do, and easy things to improve the quality of TextCat results were done.

TJones closed subtask T149323: Qualitative confidence score for TextCat as Resolved.Jan 30 2019, 10:53 PM

TJones closed subtask T155670: Investigate Ratio of First to Second Result Scores as a Confidence Measure as Declined.

Liuxinyu970226 unsubscribed.Feb 1 2019, 1:43 PM

Investigate Improvements and Confidence Measures for TextCat Language DetectionClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate Improvements and Confidence Measures for TextCat Language Detection
Closed, ResolvedPublic
Actions

Related Objects
Search...