User Story: As a search user, I want to get the same results for cross-language suggestions regardless of the case of the query, because that usually doesn't matter to me.
As noted below, searching for транзистор on English Wikipedia generates Russian cross-language suggestions, while searching for Транзистор does not (they only differ by the case of the first letter).
Language identification via TextCat is currently case-sensitive because the n-gram models were generated without case folding. This makes sense as a model because word-initial caps are different from word-final caps in many cases, and some languages, like German, have different patterns of capitalization that can help identification.
However, a side effect of that is that words that differ only by case can get different detection results—usually in the form of "no result" because one string is "too ambiguous" (i.e., there is more than one viable candidate).
It would be mostly straightforward to case-fold the existing models (merging n-gram counts) to generate case-insensitive models, but we would have to re-evaluate the models' effectiveness.
Acceptance Criteria:
- Survey of how often differently-cased versions of the same query (original, all lower, all upper, capitalized words) get different language ID results, using the current TextCat params, to get a sense of the scope of the problem.
- A review of any accuracy changes for case-folded TextCat models, using the currently optimized parameters.
- If the problem is large enough and the accuracy of case-folded models drops too much, we need a plan (i.e., a new sub-ticket) to re-optimize the TextCat params for the case-folded and slightly lower-resolution but more consistent models.
Original Description:
It's an issue I found as I was reporting T270847 :)
If I search the article namespace of the English Wikipedia for "Транзистор", I find zero results in the main screen, and one result in the right-hand sister project sidebar: "транзистор" in the English Wiktionary. The word means "transistor" in several languages that are written in the Cyrillic alphabet, and note that the search string begins with an uppercase Cyrillic letter. The title of the Wiktionary result, which is found, is written with a lowercase letter.
If I search the article namespace of the English Wikipedia for "транзистор", which is the same word, but in all lowercase letters, then I get the same Wiktionary result in the sidebar, and also many results from the Russian Wikipedia (I'd also expect other languages, but that's another issue, T270847).
Searching probably shouldn't be case-sensitive, at least not in a case like this.