Page MenuHomePhabricator

Investigate ways to make language identification case-insensitive
Open, LowPublic

Description

User Story: As a search user, I want to get the same results for cross-language suggestions regardless of the case of the query, because that usually doesn't matter to me.

As noted below, searching for транзистор on English Wikipedia generates Russian cross-language suggestions, while searching for Транзистор does not (they only differ by the case of the first letter).

Language identification via TextCat is currently case-sensitive because the n-gram models were generated without case folding. This makes sense as a model because word-initial caps are different from word-final caps in many cases, and some languages, like German, have different patterns of capitalization that can help identification.

However, a side effect of that is that words that differ only by case can get different detection results—usually in the form of "no result" because one string is "too ambiguous" (i.e., there is more than one viable candidate).

It would be mostly straightforward to case-fold the existing models (merging n-gram counts) to generate case-insensitive models, but we would have to re-evaluate the models' effectiveness.

Acceptance Criteria:

  • Survey of how often differently-cased versions of the same query (original, all lower, all upper, capitalized words) get different language ID results, using the current TextCat params, to get a sense of the scope of the problem.
  • A review of any accuracy changes for case-folded TextCat models, using the currently optimized parameters.
  • If the problem is large enough and the accuracy of case-folded models drops too much, we need a plan (i.e., a new sub-ticket) to re-optimize the TextCat params for the case-folded and slightly lower-resolution but more consistent models.

Original Description:

It's an issue I found as I was reporting T270847 :)

If I search the article namespace of the English Wikipedia for "Транзистор", I find zero results in the main screen, and one result in the right-hand sister project sidebar: "транзистор" in the English Wiktionary. The word means "transistor" in several languages that are written in the Cyrillic alphabet, and note that the search string begins with an uppercase Cyrillic letter. The title of the Wiktionary result, which is found, is written with a lowercase letter.

If I search the article namespace of the English Wikipedia for "транзистор", which is the same word, but in all lowercase letters, then I get the same Wiktionary result in the sidebar, and also many results from the Russian Wikipedia (I'd also expect other languages, but that's another issue, T270847).

Searching probably shouldn't be case-sensitive, at least not in a case like this.

Event Timeline

Removing tag, as "New tasks should be brought to us via Discovery-Search instead, from where they will be triaged." Assuming this is about the CirrusSearch codebase - please add codebase project tags when possible, for those who don't know or care about teams (and please correct the codebase project tag, if wrong). Thanks!

I've addressed this in more detail over in T270847 since a lot of the background and context is the same—see my previous comment there.

The short version is that while search is not case sensitive, language detection is, and this is an example where case made a scoring difference that was right on a meaningful threshold. транзистор is identified as Russian, with nothing else close enough to be ambiguous. Транзистор scores just a little differently, and the Ukranian score is close enough to the Russian score to count as ambiguous, so no results are shown.

If you are reporting this as a bug, then I'd say it is not a bug, it is working as designed, and we can close the ticket. If you'd like to treat this as a feature request to test rebuilding our models as case-insensitive, we could prioritize that and put it in our task queue. It would be below bringing language detection to other Wikipedias. And, of course, it might not be worth doing in the end because of the value of phonotactic hints and German capitalization—though it might well turn out to be a smidge better (and obviously less unclear to end users).

I'm also going to update the TextCat documentation to include more of this information (though a little less technically, probably) for future reference.

@Amire80, is language detection being case-sensitive enough of an explanation, and can we close this ticket? Overall the case-sensitivity is helpful, but there will always be edge cases where the case difference puts different words on different sides of a meaningful threshold. If not, that's fine and I'll update this ticket as a research/feature request.

Thanks again for the detailed explanation!

It's not as fundamentally important as the problem at T270847 because people probably search in lowercase most of the time ;)

Nevertheless, this behavior, in which there are zero results, even though there is an article with exactly this title, including the capitalization, is pretty odd, and I'd consider it a bug, even if it's far from being the most urgent one to fix.

TJones renamed this task from Cross-wiki searching shows results from Russian with lowercase letters, but not with uppercase letters to Investigate ways to make language identification case-insensitive.Jan 5 2021, 8:45 PM
TJones updated the task description. (Show Details)

Making a note here for now regarding: " regardless of the case of the query, because that usually doesn't matter to me".
There are instances in English where cased searching could matter a lot: e.g. cat ~ CAT (scan); mass ~ MASS; garbage ~ Garbage. I'm not sure if/how this shows up in other languages, and what the implications are for cross-linguistic un/cased searching. It seems like at least for English Wikipedia, we choose the most popular/probable uncased search match, else go to disambiguation.

We may or may not need to take this into consideration depending on how frequent this is, and or how impactful it is to users.

Making a note here for now regarding: " regardless of the case of the query, because that usually doesn't matter to me".
There are instances in English where cased searching could matter a lot: e.g. cat ~ CAT (scan)

Yeah, I thought of that and that's why I said "usually"—though it may not matter in this case. Exact matches on case change ranking, but not the number of results. And we have an algorithm (used in the "Go box" in the upper corner) that chooses when there are competing candidates for title matches.

FRED will give you the "FRED" page. Any other capitalization of fredFred, frEd, fReD—will give you the "Fred" page. These can be hard to find—"CAT" is a redirect to the "Cat" disambiguation page. MASS/Mass is a good example, though. And we can't distinguish garbage and Garbage on Wikipedia because almost everything has a capitalized first word, so we have "Garbage (band)".

Case-sensitive examples are much easier to find on Wiktionary, though, because they don't capitalize the first letter and you can usually find English/German cognates that only differ by capitalization, like arm/Arm (and initialism ARM, too, it turns out). Same Go box rules apply: exact matches get you "Arm" or "ARM" and anything else matches the default "arm".

CBogen moved this task from needs triage to Language Stuff on the Discovery-Search board.