Detect "wrong keyboard" queries for Russian/American keyboards on EN/RU Wikipedias
Open, HighPublic

Description

Add support to English and/or Russian Wikipedia for detecting and converting queries typed in one language on the other language's keyboard.

Examples:

  • пукьфт сгшышту (Russian phonetic transliteration: "puk'ft sgshyshtu") looks like gibberish; converting from Russian to American keyboard gives german cuisine.
  • qatktdf ,fiyz looks like gibberish, but converting from American to Russian keyboard gives эйфелева башня, "Eiffel Tower".

More details and examples are here.

We can use TextCat language detection to detect these tranliteratable gibberish strings.

Additional requirements for successful implementation include (but are not limited to):

  • possibly more data analysis limited to poorly performing queries (the analysis above is on all queries, and so overestimates the cost).
  • more complex interaction with language detections, including paying attention to "second place" language results, filtering results after language detection (see notes with more details and examples above), and having differing behaviors for different languages (i.e., showing cross-wiki results for some languages, doing query re-writes or did you mean suggestions for other languages).
  • coming up with a mechanism for dealing with multiple suggestions (e.g., this plus a spelling correction); possibilities include some sort of confidence score from each suggester, hard-coded ordering, or a nice display of multiple suggestions.

Related:

TJones created this task.Jun 29 2016, 4:17 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 29 2016, 4:17 PM
debt triaged this task as Normal priority.Jul 1 2016, 4:41 PM
debt moved this task from Needs triage to This Quarter on the Discovery-Search board.
debt added a project: Discovery.
debt added a subscriber: debt.Aug 5 2016, 7:37 PM

Comments based on a conversation with @TJones:

This has a lot of promise. My estimate was that more than 1% of all queries on ruwiki are on the wrong keyboard. That’s a big impact for a single change. I’d like to spend a little more time looking at different keyboard layouts and see if there are any others that are easy to account for, and then figure out the best way to implement this in concert with the more usual method of language ID. Also, should this be a suggestion, or a automatic additional search, or what?

We’d need A/B tests to fully answer these questions once we get it started and in a good place to test.

debt raised the priority of this task from Normal to High.Aug 5 2016, 7:42 PM

The first question here would be: do we always try to detect wrong kbd or only when the search gets no/few results (aka searchTextSecondTry)?

Assuming it's the latter, we already have two venues we can pursue when we do the second try - we can either go with suggestion, or go with other language from language detector. Now we will also have the third venue. So how we choose between them? Do we have fixed order, configurable order, choose in any other way?

Or maybe we could make it part of the suggestion? Maybe elastic could suggest it depending on index configs? Is there some facility in ES to do this?

TJones added a comment.EditedDec 1 2016, 1:27 AM

My experiment with this involved using language identification, so this would involve doing the language ID, seeing that the result is "LatinRussian" (or whatever we call it) and doing something different than searching on another wiki (like making a suggestion).

But as we add more options, we definitely need to figure out how to choose among them.

So I assume we'd have to create some textcat models for LatinRussian etc., they don't currently exist?

TJones added a comment.EditedDec 1 2016, 1:51 AM

I do have models for LatinRussian and CyrillicEnglish. They are pretty easy to generate, actually, since you can just get the mapping from one keyboard to the other and apply it to the original language model.

The complication is that there are potentially multiple possible keyboard mappings for both languages. I only spent one day working on this, so I haven't tried to figure out what to do about that. Minor differences in keyboard layouts might not affect detection too much, but it would affect the mapping from the wrong character set back into the right one. A few messed up characters could make the transformed query useless.

I don't think we could distinguish USLatinRussian from UKLatinRussian. We could try both (or even more than two) mappings and then choose the mapping with the terms with the higher frequencies in Elastic, for example.

It's a question of whether it makes a difference often enough to try to distinguish between keyboards, how expensive it is to do a freq check to distinguish them, etc. That's all stuff that needs a bit of research.

I think it's OK to start with two models, build the infrastructure around it and then when it's working we can proceed to add more models and refine these models as we wish. I think also the best way to work with the models is to put them in separate directory - either in textcat or Cirrus, not sure - to not confuse them with "normal" models. Now that we can do multi-directory configs, it's easy :)

TJones added a comment.Dec 5 2016, 9:18 PM

But as we add more options, we definitely need to figure out how to choose among them.

I've put together a straw man proposal for how to deal with suggestions, quotes, "wrong keyboard", and language ID, and have a more co-ordinated conversation about all this. Please direct comments about the bigger picture to the talk page of the link above.

Kaganer added a subscriber: Kaganer.Jul 4 2017, 3:07 PM

What about any progress in this task?
I have many requests from users about this topic (in the social networks) regarding the inclusion of such a function in Russian Wikipedia. Today it is a very popular service, provided by leading search engines, but does not provided by Wikipedia.

TJones updated the task description. (Show Details)Jul 5 2017, 1:52 PM
TJones added a comment.Jul 5 2017, 2:12 PM

What about any progress in this task?
I have many requests from users about this topic (in the social networks) regarding the inclusion of such a function in Russian Wikipedia. Today it is a very popular service, provided by leading search engines, but does not provided by Wikipedia.

Our Russian-speaking colleagues on the Discovery team pointed this out, too, and I realized we could probably do it with the language detection we already had (TextCat), so I tested it and posted the results (links in the task description). This ticket was a placeholder to remind us to try to eventually work on it.

We already have the necessary core technology—the ability to detect the wrong keyboard and map from one keyboard to another. However, the infrastructure to support it is more complicated, and that's currently the blocking factor. Stas's comment above highlights the crux of the problem:

The first question here would be: do we always try to detect wrong kbd or only when the search gets no/few results (aka searchTextSecondTry)?

Assuming it's the latter, we already have two venues we can pursue when we do the second try - we can either go with suggestion, or go with other language from language detector. Now we will also have the third venue. So how we choose between them? Do we have fixed order, configurable order, choose in any other way?

Or maybe we could make it part of the suggestion? Maybe elastic could suggest it depending on index configs? Is there some facility in ES to do this?

We've got two related tasks that further complicate this:

We could suggest queries without quotes, or automatically run queries without quotes if the quoted query gets no results. Do we support one, or both? Is it configurable by wiki? How does it interact with wrong keyboard detection, language detection, and "did you mean" spelling suggestions.

We started highlighting the issues in T156019, but haven't worked on it in a while because it's an ugly mess and other tasks have taken priority each quarter for the last year. We haven't forgotten, but there have been more pressing tasks and this one requires a lot of infrastructure changes that we haven't had the bandwidth to handle.

Very thanks for the explanation!

As an example of acceptable behavior, I would like pointed to a gadget https://he.wikipedia.org/wiki/MediaWiki:Gadget-Dwim.js
I now localized this for the Russian language - https://ru.wikipedia.org/wiki/User:Kaganer/Gadget-Dwim.js - and I hope that it can be enabled by default.

But this is a palliative, since this gadget only works in the "top right" search field, but not in the main search field on the Special:Search.

IKhitron added a subscriber: IKhitron.
TJones moved this task from Backlog to In progress on the Discovery-Search (Current work) board.
TJones claimed this task.

I've left some comments on the discussion page for the DWIM Gadget suggesting changes that should improve the performance of the gadget and make it work on the main search field on Special:Search.

More details: Some Russian letters map to punctuation on the US English keyboard, so toLowerCase() doesn't work (it won't map : to ;, for example). I suggested a change to just explicitly map all the characters (upper and lowercase) directly, and not go through a lowercase shortcut. Also, possibly because of a change related to OOUI, neither the Hebrew nor Russian version of DWIM worked on the main search input form on the Special:Search page. Adding another selector to the list of $searchBoxes solves that.

I don't have permissions to make any changes to the gadget, but I've contacted the user who enabled it and the user who ported it from Hebrew to Russian (who turns out to be @Kaganer, who is already subscribed to this ticket!).

I'm still looking at detecting wrong keyboards after the search has been issued and making suggestions at that time.