Page MenuHomePhabricator

Investigate Improvements and Confidence Measures for TextCat Language Detection
Closed, ResolvedPublic


While mulling this over in the back of my mind while working on other stuff for several weeks, it has morphed into several ways to improve some of the shortcomings of TextCat, as well as ways to report a somewhat meaningful score (which TextCat be default isn't really well-equipped to do).

So, I've converted this to an EPIC and I'll add a number of sub-tasks to it.

I've kept the original description below.

Some potential measures of confidence that could be applied to TextCat are laid out in "TextCat and Confidence"; briefly, we could consider the number of language guesses TextCat gives, compare the score of the second-best guess to the best guess, or compare a score to the theoretical worst possible score for a given string.

It's possible to test any (or all) of those measures against existing query sets and try to find a threshold for better/worse language identification performance.

Possible corpora for testing:

  • Trey's language evaluation sets: the corpora are small, but we have a reasonable approximation of "the truth" for each identification.
  • Mikhail's TextCat A/B test data: the corpora are probably larger (not sure on the breakdown by wiki), and a lot more data can be gotten relatively easily without a lot of manual work to get it; evaluation could be based on clicks, which depend not only on language ID, but also whether there's anything relevant in the interwiki results.
  • Randomly selected queries: We could identify them by language, and the number of results from a cross-wiki search could be used as a proxy for language identification quality. Data is effectively unlimited and easy to get, though evaluation would be less direct.

Potential problems include:

  • general sparsity of data for evaluation
  • skews in the presence of click-worthy results by language (e.g., people searching in Spanish on enwiki get more/better results than people searching in German, even though language detection works equally well in either case)

Implementation notes (note to future selves):

  • It seems likely that the confidence threshold should be set on a per-wiki basis, based on the concerns expressed in "TextCat and Confidence" and elsewhere. (Though it's possible that they could all work out to be really similar.)
  • The confidence thresholding could happen internally to TextCat, based on a threshold parameter, since it already fails to return a value when too many candidates are likely (by default: >5, within 5%).

Related Objects

Event Timeline

debt triaged this task as Medium priority.Jul 14 2016, 10:20 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.

This sounds like something that is very good to do! :)

debt raised the priority of this task from Medium to High.Aug 4 2016, 6:30 PM
debt moved this task from This Quarter to Up Next on the Discovery-Search board.

Let's do this sooner rather than later as it might give good insight to other tickets! :)

comments based on a conversation with @TJones:

It’s straightforward to run a few tests using a few possible confidence measures. We have the language evaluation sets (small, but manually categorized) and @mpopov's A/B Test data (large but categories are only heuristic: clicked/unclicked).
If a confidence measure gives improved results for either case, it’s worth pursuing.

TJones renamed this task from Investigate Confidence Measures for TextCat Language Detection to Investigate Improvements and Confidence Measures for TextCat Language Detection.Oct 27 2016, 3:43 PM
TJones added a project: Epic.
TJones updated the task description. (Show Details)
TJones removed TJones as the assignee of this task.Feb 8 2017, 4:26 PM
Deskana lowered the priority of this task from High to Medium.Mar 16 2017, 10:24 PM
Deskana moved this task from Current work to search-icebox on the Discovery-Search board.
Deskana added a subscriber: Deskana.

This task requires the attention of @TJones, and given the efforts we're devoting to language analyser work which we've deemed more important, I'm moving this task out of the sprint to reflect the reality of the situation that this won't be worked on soon.

TJones claimed this task.

Closing this because after looking into it a while back I decided that internal confidence isn't really a thing for TextCat to do, and easy things to improve the quality of TextCat results were done.