RU Wikipedias
Open, MediumPublic
Actions

Assigned To

None

Authored By

	TJones
	Jun 29 2016, 4:17 PM

Description

Add support to English and/or Russian Wikipedia for detecting and converting queries typed in one language on the other language's keyboard.

Examples:

пукьфт сгшышту (Russian phonetic transliteration: "puk'ft sgshyshtu") looks like gibberish; converting from Russian to American keyboard gives german cuisine.
qatktdf ,fiyz looks like gibberish, but converting from American to Russian keyboard gives эйфелева башня, "Eiffel Tower".

More details and examples are here.

We can use TextCat language detection to detect these tranliteratable gibberish strings.

Additional requirements for successful implementation include (but are not limited to):

possibly more data analysis limited to poorly performing queries (the analysis above is on all queries, and so overestimates the cost).
more complex interaction with language detections, including paying attention to "second place" language results, filtering results after language detection (see notes with more details and examples above), and having differing behaviors for different languages (i.e., showing cross-wiki results for some languages, doing query re-writes or did you mean suggestions for other languages).
coming up with a mechanism for dealing with multiple suggestions (e.g., this plus a spelling correction); possibilities include some sort of confidence score from each suggester, hard-coded ordering, or a nice display of multiple suggestions.

Likely prerequisite: T156019: Develop plan for dealing with numerous second-try searches, aka "So Many Search Options"
Similar task: T155104: Detect "wrong keyboard" queries for Hebrew/American keyboards on EN/HE Wikipedias

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Open	None	T138958 Detect "wrong keyboard" queries for Russian/American keyboards on EN/RU Wikipedias
Resolved	TJones	T213931 Update TextCat with wrong-keyboard models
Declined	TJones	T213935 Revert changes to TextCat that add dependency on autoload.php
Resolved	Smalyshev	T213936 Deploy new version of TextCat
Resolved	TJones	T216083 Update required version of TextCat in CirrusSearch

Event Timeline

TJones created this task.Jun 29 2016, 4:17 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 29 2016, 4:17 PM

debt triaged this task as Medium priority.Jul 1 2016, 4:41 PM

debt moved this task from needs triage to This Quarter on the Discovery-Search board.

debt added a project: Discovery-ARCHIVED.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Aug 3 2016, 8:23 PM

Comments based on a conversation with @TJones:

This has a lot of promise. My estimate was that more than 1% of all queries on ruwiki are on the wrong keyboard. That’s a big impact for a single change. I’d like to spend a little more time looking at different keyboard layouts and see if there are any others that are easy to account for, and then figure out the best way to implement this in concert with the more usual method of language ID. Also, should this be a suggestion, or a automatic additional search, or what?

We’d need A/B tests to fully answer these questions once we get it started and in a good place to test.

debt raised the priority of this task from Medium to High.Aug 5 2016, 7:42 PM

The first question here would be: do we always try to detect wrong kbd or only when the search gets no/few results (aka searchTextSecondTry)?

Assuming it's the latter, we already have two venues we can pursue when we do the second try - we can either go with suggestion, or go with other language from language detector. Now we will also have the third venue. So how we choose between them? Do we have fixed order, configurable order, choose in any other way?

Or maybe we could make it part of the suggestion? Maybe elastic could suggest it depending on index configs? Is there some facility in ES to do this?

My experiment with this involved using language identification, so this would involve doing the language ID, seeing that the result is "LatinRussian" (or whatever we call it) and doing something different than searching on another wiki (like making a suggestion).

But as we add more options, we definitely need to figure out how to choose among them.

So I assume we'd have to create some textcat models for LatinRussian etc., they don't currently exist?

TJones mentioned this in T149307: CirrusSearch: Replace double quotes with spaces in queries.Dec 1 2016, 1:38 AM

I do have models for LatinRussian and CyrillicEnglish. They are pretty easy to generate, actually, since you can just get the mapping from one keyboard to the other and apply it to the original language model.

The complication is that there are potentially multiple possible keyboard mappings for both languages. I only spent one day working on this, so I haven't tried to figure out what to do about that. Minor differences in keyboard layouts might not affect detection too much, but it would affect the mapping from the wrong character set back into the right one. A few messed up characters could make the transformed query useless.

I don't think we could distinguish USLatinRussian from UKLatinRussian. We could try both (or even more than two) mappings and then choose the mapping with the terms with the higher frequencies in Elastic, for example.

It's a question of whether it makes a difference often enough to try to distinguish between keyboards, how expensive it is to do a freq check to distinguish them, etc. That's all stuff that needs a bit of research.

I think it's OK to start with two models, build the infrastructure around it and then when it's working we can proceed to add more models and refine these models as we wish. I think also the best way to work with the models is to put them in separate directory - either in textcat or Cirrus, not sure - to not confuse them with "normal" models. Now that we can do multi-directory configs, it's easy :)

In T138958#2837032, @TJones wrote:

But as we add more options, we definitely need to figure out how to choose among them.

I've put together a straw man proposal for how to deal with suggestions, quotes, "wrong keyboard", and language ID, and have a more co-ordinated conversation about all this. Please direct comments about the bigger picture to the talk page of the link above.

TJones mentioned this in T155104: Detect "wrong keyboard" queries for Hebrew/American keyboards on EN/HE Wikipedias.Jan 11 2017, 6:23 PM

TJones updated the task description. (Show Details)

Kaganer subscribed.Jul 4 2017, 3:07 PM

What about any progress in this task?
I have many requests from users about this topic (in the social networks) regarding the inclusion of such a function in Russian Wikipedia. Today it is a very popular service, provided by leading search engines, but does not provided by Wikipedia.

TJones updated the task description. (Show Details)Jul 5 2017, 1:52 PM

In T138958#3404964, @Kaganer wrote:

What about any progress in this task?
I have many requests from users about this topic (in the social networks) regarding the inclusion of such a function in Russian Wikipedia. Today it is a very popular service, provided by leading search engines, but does not provided by Wikipedia.

Our Russian-speaking colleagues on the Discovery team pointed this out, too, and I realized we could probably do it with the language detection we already had (TextCat), so I tested it and posted the results (links in the task description). This ticket was a placeholder to remind us to try to eventually work on it.

We already have the necessary core technology—the ability to detect the wrong keyboard and map from one keyboard to another. However, the infrastructure to support it is more complicated, and that's currently the blocking factor. Stas's comment above highlights the crux of the problem:

In T138958#2836929, @Smalyshev wrote:

The first question here would be: do we always try to detect wrong kbd or only when the search gets no/few results (aka searchTextSecondTry)?

Assuming it's the latter, we already have two venues we can pursue when we do the second try - we can either go with suggestion, or go with other language from language detector. Now we will also have the third venue. So how we choose between them? Do we have fixed order, configurable order, choose in any other way?

Or maybe we could make it part of the suggestion? Maybe elastic could suggest it depending on index configs? Is there some facility in ES to do this?

We've got two related tasks that further complicate this:

We could suggest queries without quotes, or automatically run queries without quotes if the quoted query gets no results. Do we support one, or both? Is it configurable by wiki? How does it interact with wrong keyboard detection, language detection, and "did you mean" spelling suggestions.

We started highlighting the issues in T156019, but haven't worked on it in a while because it's an ugly mess and other tasks have taken priority each quarter for the last year. We haven't forgotten, but there have been more pressing tasks and this one requires a lot of infrastructure changes that we haven't had the bandwidth to handle.

Very thanks for the explanation!

As an example of acceptable behavior, I would like pointed to a gadget https://he.wikipedia.org/wiki/MediaWiki:Gadget-Dwim.js
I now localized this for the Russian language - https://ru.wikipedia.org/wiki/User:Kaganer/Gadget-Dwim.js - and I hope that it can be enabled by default.

But this is a palliative, since this gadget only works in the "top right" search field, but not in the main search field on the Special:Search.

IKhitron awarded a token.Aug 28 2017, 9:30 PM

IKhitron subscribed.

santhosh subscribed.Aug 29 2017, 5:34 AM

TJones moved this task from This Quarter to Tech Debt/Misc on the Discovery-Search board.Oct 24 2017, 5:36 PM

TJones claimed this task.Oct 16 2018, 5:20 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

TJones moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.Oct 31 2018, 7:28 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Nov 13 2018, 6:21 PM

I've left some comments on the discussion page for the DWIM Gadget suggesting changes that should improve the performance of the gadget and make it work on the main search field on Special:Search.

More details: Some Russian letters map to punctuation on the US English keyboard, so toLowerCase() doesn't work (it won't map : to ;, for example). I suggested a change to just explicitly map all the characters (upper and lowercase) directly, and not go through a lowercase shortcut. Also, possibly because of a change related to OOUI, neither the Hebrew nor Russian version of DWIM worked on the main search input form on the Special:Search page. Adding another selector to the list of $searchBoxes solves that.

I don't have permissions to make any changes to the gadget, but I've contacted the user who enabled it and the user who ported it from Hebrew to Russian (who turns out to be @Kaganer, who is already subscribed to this ticket!).

I'm still looking at detecting wrong keyboards after the search has been issued and making suggestions at that time.

The Russian DWIM gadget has been updated! It works better with capital letters, and works on the main search input on the Special:Search page.

Update: We had to revert the change that enabled it on the the main search input on the Special:Search page. It didn't replace the existing suggestions; it added another suggestion box on top of it. Something to do with OOUI that I couldn't unravel.

debt closed subtask T213931: Update TextCat with wrong-keyboard models as Resolved.Jan 18 2019, 7:06 PM

I finished my write up on MediaWiki about optimizing the TextCat params for wrong-keyboard and wrong-encoding detection, and finalizing the design decisions for the implementation.

debt moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.Jan 31 2019, 10:54 PM

Liuxinyu970226 added a project: Russian-Sites.Feb 1 2019, 1:44 PM

Base subscribed.Feb 1 2019, 8:45 PM

Removing this from current work and moving it to the "Language Stuff" backlog. I'm the only one who could work on this this quarter, and I'm a bit out of my depth with the integration. We'll reprioritize this for future work when we can assign a slightly larger team (≥2 people) to work on it.

debt closed subtask T213936: Deploy new version of TextCat as Resolved.Feb 15 2019, 6:56 PM

TJones lowered the priority of this task from High to Medium.Apr 26 2019, 5:28 PM

TJones mentioned this in T245677: Reader searches with romanized version of non-Latin script.Feb 26 2020, 8:32 PM

TJones removed TJones as the assignee of this task.Mar 18 2020, 6:30 PM

TJones mentioned this in T262566: Enable DWIM support for Vue.js search.Jan 8 2021, 9:17 PM

Aklapper merged a task: T127003: Inter language script detection in search.May 15 2021, 4:19 PM

Aklapper added subscribers: CKoerner_WMF, StudiesWorld.

TJones mentioned this in T127003: Inter language script detection in search.May 17 2021, 9:51 PM