Page MenuHomePhabricator

Detect "wrong keyboard" queries for Russian/American keyboards on EN/RU Wikipedias
Open, MediumPublic

Description

Add support to English and/or Russian Wikipedia for detecting and converting queries typed in one language on the other language's keyboard.

Examples:

  • пукьфт сгшышту (Russian phonetic transliteration: "puk'ft sgshyshtu") looks like gibberish; converting from Russian to American keyboard gives german cuisine.
  • qatktdf ,fiyz looks like gibberish, but converting from American to Russian keyboard gives эйфелева башня, "Eiffel Tower".

More details and examples are here.

We can use TextCat language detection to detect these tranliteratable gibberish strings.

Additional requirements for successful implementation include (but are not limited to):

  • possibly more data analysis limited to poorly performing queries (the analysis above is on all queries, and so overestimates the cost).
  • more complex interaction with language detections, including paying attention to "second place" language results, filtering results after language detection (see notes with more details and examples above), and having differing behaviors for different languages (i.e., showing cross-wiki results for some languages, doing query re-writes or did you mean suggestions for other languages).
  • coming up with a mechanism for dealing with multiple suggestions (e.g., this plus a spelling correction); possibilities include some sort of confidence score from each suggester, hard-coded ordering, or a nice display of multiple suggestions.

Related:

Event Timeline

debt triaged this task as Medium priority.Jul 1 2016, 4:41 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added a project: Discovery-ARCHIVED.

Comments based on a conversation with @TJones:

This has a lot of promise. My estimate was that more than 1% of all queries on ruwiki are on the wrong keyboard. That’s a big impact for a single change. I’d like to spend a little more time looking at different keyboard layouts and see if there are any others that are easy to account for, and then figure out the best way to implement this in concert with the more usual method of language ID. Also, should this be a suggestion, or a automatic additional search, or what?

We’d need A/B tests to fully answer these questions once we get it started and in a good place to test.

debt raised the priority of this task from Medium to High.Aug 5 2016, 7:42 PM

The first question here would be: do we always try to detect wrong kbd or only when the search gets no/few results (aka searchTextSecondTry)?

Assuming it's the latter, we already have two venues we can pursue when we do the second try - we can either go with suggestion, or go with other language from language detector. Now we will also have the third venue. So how we choose between them? Do we have fixed order, configurable order, choose in any other way?

Or maybe we could make it part of the suggestion? Maybe elastic could suggest it depending on index configs? Is there some facility in ES to do this?

My experiment with this involved using language identification, so this would involve doing the language ID, seeing that the result is "LatinRussian" (or whatever we call it) and doing something different than searching on another wiki (like making a suggestion).

But as we add more options, we definitely need to figure out how to choose among them.

So I assume we'd have to create some textcat models for LatinRussian etc., they don't currently exist?

I do have models for LatinRussian and CyrillicEnglish. They are pretty easy to generate, actually, since you can just get the mapping from one keyboard to the other and apply it to the original language model.

The complication is that there are potentially multiple possible keyboard mappings for both languages. I only spent one day working on this, so I haven't tried to figure out what to do about that. Minor differences in keyboard layouts might not affect detection too much, but it would affect the mapping from the wrong character set back into the right one. A few messed up characters could make the transformed query useless.

I don't think we could distinguish USLatinRussian from UKLatinRussian. We could try both (or even more than two) mappings and then choose the mapping with the terms with the higher frequencies in Elastic, for example.

It's a question of whether it makes a difference often enough to try to distinguish between keyboards, how expensive it is to do a freq check to distinguish them, etc. That's all stuff that needs a bit of research.

I think it's OK to start with two models, build the infrastructure around it and then when it's working we can proceed to add more models and refine these models as we wish. I think also the best way to work with the models is to put them in separate directory - either in textcat or Cirrus, not sure - to not confuse them with "normal" models. Now that we can do multi-directory configs, it's easy :)

But as we add more options, we definitely need to figure out how to choose among them.

I've put together a straw man proposal for how to deal with suggestions, quotes, "wrong keyboard", and language ID, and have a more co-ordinated conversation about all this. Please direct comments about the bigger picture to the talk page of the link above.

What about any progress in this task?
I have many requests from users about this topic (in the social networks) regarding the inclusion of such a function in Russian Wikipedia. Today it is a very popular service, provided by leading search engines, but does not provided by Wikipedia.

What about any progress in this task?
I have many requests from users about this topic (in the social networks) regarding the inclusion of such a function in Russian Wikipedia. Today it is a very popular service, provided by leading search engines, but does not provided by Wikipedia.

Our Russian-speaking colleagues on the Discovery team pointed this out, too, and I realized we could probably do it with the language detection we already had (TextCat), so I tested it and posted the results (links in the task description). This ticket was a placeholder to remind us to try to eventually work on it.

We already have the necessary core technology—the ability to detect the wrong keyboard and map from one keyboard to another. However, the infrastructure to support it is more complicated, and that's currently the blocking factor. Stas's comment above highlights the crux of the problem:

The first question here would be: do we always try to detect wrong kbd or only when the search gets no/few results (aka searchTextSecondTry)?

Assuming it's the latter, we already have two venues we can pursue when we do the second try - we can either go with suggestion, or go with other language from language detector. Now we will also have the third venue. So how we choose between them? Do we have fixed order, configurable order, choose in any other way?

Or maybe we could make it part of the suggestion? Maybe elastic could suggest it depending on index configs? Is there some facility in ES to do this?

We've got two related tasks that further complicate this:

We could suggest queries without quotes, or automatically run queries without quotes if the quoted query gets no results. Do we support one, or both? Is it configurable by wiki? How does it interact with wrong keyboard detection, language detection, and "did you mean" spelling suggestions.

We started highlighting the issues in T156019, but haven't worked on it in a while because it's an ugly mess and other tasks have taken priority each quarter for the last year. We haven't forgotten, but there have been more pressing tasks and this one requires a lot of infrastructure changes that we haven't had the bandwidth to handle.

Very thanks for the explanation!

As an example of acceptable behavior, I would like pointed to a gadget https://he.wikipedia.org/wiki/MediaWiki:Gadget-Dwim.js
I now localized this for the Russian language - https://ru.wikipedia.org/wiki/User:Kaganer/Gadget-Dwim.js - and I hope that it can be enabled by default.

But this is a palliative, since this gadget only works in the "top right" search field, but not in the main search field on the Special:Search.

I've left some comments on the discussion page for the DWIM Gadget suggesting changes that should improve the performance of the gadget and make it work on the main search field on Special:Search.

More details: Some Russian letters map to punctuation on the US English keyboard, so toLowerCase() doesn't work (it won't map : to ;, for example). I suggested a change to just explicitly map all the characters (upper and lowercase) directly, and not go through a lowercase shortcut. Also, possibly because of a change related to OOUI, neither the Hebrew nor Russian version of DWIM worked on the main search input form on the Special:Search page. Adding another selector to the list of $searchBoxes solves that.

I don't have permissions to make any changes to the gadget, but I've contacted the user who enabled it and the user who ported it from Hebrew to Russian (who turns out to be @Kaganer, who is already subscribed to this ticket!).

I'm still looking at detecting wrong keyboards after the search has been issued and making suggestions at that time.

The Russian DWIM gadget has been updated! It works better with capital letters, and works on the main search input on the Special:Search page.

Update: We had to revert the change that enabled it on the the main search input on the Special:Search page. It didn't replace the existing suggestions; it added another suggestion box on top of it. Something to do with OOUI that I couldn't unravel.

I finished my write up on MediaWiki about optimizing the TextCat params for wrong-keyboard and wrong-encoding detection, and finalizing the design decisions for the implementation.

TJones moved this task from Tech Debt/Misc to Language Stuff on the Discovery-Search board.

Removing this from current work and moving it to the "Language Stuff" backlog. I'm the only one who could work on this this quarter, and I'm a bit out of my depth with the integration. We'll reprioritize this for future work when we can assign a slightly larger team (≥2 people) to work on it.

TJones lowered the priority of this task from High to Medium.Apr 26 2019, 5:28 PM