Investigate Improvements and Confidence Measures for TextCat Language Detection
Open, NormalPublic

Description

While mulling this over in the back of my mind while working on other stuff for several weeks, it has morphed into several ways to improve some of the shortcomings of TextCat, as well as ways to report a somewhat meaningful score (which TextCat be default isn't really well-equipped to do).

So, I've converted this to an EPIC and I'll add a number of sub-tasks to it.

I've kept the original description below.

-=-=-=-=-=-
Some potential measures of confidence that could be applied to TextCat are laid out in "TextCat and Confidence"; briefly, we could consider the number of language guesses TextCat gives, compare the score of the second-best guess to the best guess, or compare a score to the theoretical worst possible score for a given string.

It's possible to test any (or all) of those measures against existing query sets and try to find a threshold for better/worse language identification performance.

Possible corpora for testing:

  • Trey's language evaluation sets: the corpora are small, but we have a reasonable approximation of "the truth" for each identification.
  • Mikhail's TextCat A/B test data: the corpora are probably larger (not sure on the breakdown by wiki), and a lot more data can be gotten relatively easily without a lot of manual work to get it; evaluation could be based on clicks, which depend not only on language ID, but also whether there's anything relevant in the interwiki results.
  • Randomly selected queries: We could identify them by language, and the number of results from a cross-wiki search could be used as a proxy for language identification quality. Data is effectively unlimited and easy to get, though evaluation would be less direct.

Potential problems include:

  • general sparsity of data for evaluation
  • skews in the presence of click-worthy results by language (e.g., people searching in Spanish on enwiki get more/better results than people searching in German, even though language detection works equally well in either case)

Implementation notes (note to future selves):

  • It seems likely that the confidence threshold should be set on a per-wiki basis, based on the concerns expressed in "TextCat and Confidence" and elsewhere. (Though it's possible that they could all work out to be really similar.)
  • The confidence thresholding could happen internally to TextCat, based on a threshold parameter, since it already fails to return a value when too many candidates are likely (by default: >5, within 5%).

Related Objects

TJones created this task.Jul 13 2016, 7:34 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 13 2016, 7:34 PM
debt triaged this task as Normal priority.Jul 14 2016, 10:20 PM
debt moved this task from Needs triage to This Quarter on the Discovery-Search board.

This sounds like something that is very good to do! :)

debt moved this task from This Quarter to Up Next on the Discovery-Search board.Aug 4 2016, 6:30 PM
debt raised the priority of this task from Normal to High.

Let's do this sooner rather than later as it might give good insight to other tickets! :)

debt added a comment.Aug 5 2016, 7:41 PM

comments based on a conversation with @TJones:

It’s straightforward to run a few tests using a few possible confidence measures. We have the language evaluation sets (small, but manually categorized) and @mpopov's A/B Test data (large but categories are only heuristic: clicked/unclicked).
If a confidence measure gives improved results for either case, it’s worth pursuing.

TJones claimed this task.
TJones renamed this task from Investigate Confidence Measures for TextCat Language Detection to Investigate Improvements and Confidence Measures for TextCat Language Detection.Oct 27 2016, 3:43 PM
TJones added a project: Epic.
TJones updated the task description. (Show Details)
TJones updated the task description. (Show Details)Jan 18 2017, 9:28 PM
TJones removed TJones as the assignee of this task.Feb 8 2017, 4:26 PM
Deskana moved this task from Current work to Later on the Discovery-Search board.Mar 16 2017, 10:24 PM
Deskana lowered the priority of this task from High to Normal.
Deskana added a subscriber: Deskana.

This task requires the attention of @TJones, and given the efforts we're devoting to language analyser work which we've deemed more important, I'm moving this task out of the sprint to reflect the reality of the situation that this won't be worked on soon.

mmodell moved this task from Needs triage to Later on the Discovery-Search board.