While mulling this over in the back of my mind while working on other stuff for several weeks, it has morphed into several ways to improve some of the shortcomings of TextCat, as well as ways to report a somewhat meaningful score (which TextCat be default isn't really well-equipped to do).
So, I've converted this to an EPIC and I'll add a number of sub-tasks to it.
I've kept the original description below.
-=-=-=-=-=-
Some potential measures of confidence that could be applied to TextCat are laid out in "TextCat and Confidence"; briefly, we could consider the number of language guesses TextCat gives, compare the score of the second-best guess to the best guess, or compare a score to the theoretical worst possible score for a given string.
It's possible to test any (or all) of those measures against existing query sets and try to find a threshold for better/worse language identification performance.
Possible corpora for testing:
- Trey's language evaluation sets: the corpora are small, but we have a reasonable approximation of "the truth" for each identification.
- Mikhail's TextCat A/B test data: the corpora are probably larger (not sure on the breakdown by wiki), and a lot more data can be gotten relatively easily without a lot of manual work to get it; evaluation could be based on clicks, which depend not only on language ID, but also whether there's anything relevant in the interwiki results.
- Randomly selected queries: We could identify them by language, and the number of results from a cross-wiki search could be used as a proxy for language identification quality. Data is effectively unlimited and easy to get, though evaluation would be less direct.
Potential problems include:
- general sparsity of data for evaluation
- skews in the presence of click-worthy results by language (e.g., people searching in Spanish on enwiki get more/better results than people searching in German, even though language detection works equally well in either case)
Implementation notes (note to future selves):
- It seems likely that the confidence threshold should be set on a per-wiki basis, based on the concerns expressed in "TextCat and Confidence" and elsewhere. (Though it's possible that they could all work out to be really similar.)
- The confidence thresholding could happen internally to TextCat, based on a threshold parameter, since it already fails to return a value when too many candidates are likely (by default: >5, within 5%).