Page MenuHomePhabricator

Reader searches with romanized version of non-Latin script
Closed, DeclinedPublic


"As a Reader, I want to search using a Latinized transliteration of my native script, so that I don't have to swap my device's character set to search for pages."

On Hebrew and Russian wikis, the DWIM gadget will detect if a Latin script string has results under a given threshold, and will do a second search with transliterated characters if so. Desktop Web team wants this enabled at the API level instead of in the client, to save an HTTP hit.

Event Timeline

Feedback from @Anomie

doesn't seem like it belongs in the API layer. If it's done server-side, I'd say it should be done inside the backend (e.g. CirrusSearch) along with other kinds of query rewriting. Much like how currently does "Showing results for washington. No results found for Washingtxn."

What @Anomie said I support. Did you talk to the Search Platform team about this?

We've (@TJones) talked about this in the past, but it never made it high enough up the priority list. Essentially the existing language detection code can be re-purposed to detect the language of "hebrew but transliterated to qwerty", after which it can transliterate and run a second-try search (the "Showing results for washington. No results found for Washingtxn") if the first search has poor enough results. There is nothing ground breaking here, but it would have to be prioritized as it will take some time to work out properly without simply doubling the query load for certain languages.

Low priority; we'll be disappointed but this isn't 100% necessary to ship.

The approach taken by DWIM is query expansion. This works as long as the transliteration from latin to the non-latin script is unambiguous. For cases where the transliteration from latin script is ambiguous, but the transliteration to latin is clear, putting a transliterated version of the text into the search index, similar to how case folding is done, would be preferable for performance and reliability.

I left comments on the talk page for Core REST API in MediaWiki Epic 1.5 about this almost 2 weeks ago, but I guess no one saw it.

DWIM and transliteration (Latinized or otherwise) are two completely different things.

From the talk page:

DWIM, which is currently installed on Russian and Hebrew Wikipedias (and maybe other projects) catches wrong-keyboard mistakes. If I switch to the Russian keyboard and type "dwim" I get "вцшь", on the Hebrew keyboard it's "ג'ןצ". These are not transliterations, and the output is usually gibberish (but recoverable).
Transliteration is much more difficult: the wrong-keyboard mapping is one-to-one and exact (as long as you commit to a particular pair of keyboards), but transliteration can be much harder, and depends not just on the scripts, but also the languages you are transliterating to and from.
For example, Щедрин is transliterated as Shchedrin in English, Sxedrín in Catalan, Ščedrin in Czech, Sjtjedrin in Danish, Schtschedrin in German, Chtchedrine in French, etc. This can be true for any name with Щ in it. Чайковский, on the other hand, is Tchaikovsky in English, instead of the expected Chaikovsky because we adopted the French spelling for.. uh.. "historical reasons".
Crimean Tatar transliteration is word-specific (and depends in part on what language the word came into the language from) and full of exception cases. This code is based on the same source as the Crimean Tatar transliteration used on
I'm less familar with the Indic languages—@santhosh knows a ton about them, though—but I believe the transliteration between them is usually/often/sometimes? straightforward, but I worry that the transliteration into English or other langauges using the Latin alphabet may be variable, as with Cyrillic.
Anyway, it would be good to decide which use case you are supporting (maybe both!)—just don't conflate the two!

Another issue is that DWIM is for the completion suggester (the upper right/left search box on wiki pages) which only matches against titles. Both full-blown support for the wrong keyboard and "using a Latinized transliteration of my native script"—which again are completely different things—are for full-text searching and are much more complicated.

Our original plan (T138958) for full-blown wrong-keyboard support was to use the same technique as we do for language detection when a query returns too few results (less than 3 is our usual threshold). I built (T213931) and deployed (T213936) models but never completed the integration work in CirrusSearch. (I also noticed relatively frequent old win1251 encodings and built a model for that, too.)

I think there are three different use cases that could come under this ticket:

  1. Support for adding DWIM-like wrong-keyboard results to the completion suggester. Obviously, this could be done in the UI, since that what DWIM does. It would also be possible to build it into the API, passing a threshold and a conversion string (Hebrew DWIM uses "qwertyuiopasdfghjkl;zxcvbnm,./'קראטוןםפשדגכעיחלךףזסבהנמצתץ" and Russian DWIM uses "qwertyuiop[]asdfghjkl;'zxcvbnm,./`QWERTYUIOP{}ASDFGHJKL:\"ZXCVBNM<>?~#^йцукенгшщзхъфывапролджэячсмитьбю.ёЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЯЧСМИТЬБЮ,Ё№:"). That Russian one is quite a long one to pass around, so we could have "he" and "ru" options, for example, but that would make it harder for the community to change the mapping if the predominant keyboards change—an unlikely but possible scenario.
  2. Full-blown wrong-keyboard support for our full-text search. We could do something DWIM-like (I called this the "lazy" approach in my write up because it applies the mapping all the time, and may lead to false positives, especially on short queries. The "aggressive" approach, which I was working on for Russian, uses the language-detection models to only offer suggestions when a wrong-keyboard error is more likely.
  3. Actual transliteration, which is what the main use case here describes. As @daniel said above, this is more like case folding than the others, which are a kind of query expansion. This could be done in some cases in Elasticsearch with filters (we do this already for Serbian—though it is built into the stemmer), but they would have to be configured and tested for each language pair (keep in mind that Latin transliteration is usually not the same into English as it would be into French or into German—just assuming English may not be correct for certain languages/communities). In difficult cases (like Crimean Tatar) building such a transliterator could be a lot of work. I think testing for these things is very important because they can cause odd conflations of words, and can interact with other parts of the language analysis (like stemming) depending on what language support we already have for given languages.

DWIM and transliteration (Latinized or otherwise) are two completely different things.

Ah! Thanks for pointing this out. In the light of that, the task description should be clarified. It's not clear to me what use case is intended.

OK! My understanding was that we were talking about informal mappings of non-latin languages into ASCII scripts, like the Arabic chat alphabet, Greeklish or informal romanizations of Cyrillic.

The wrong-keyboard problem is pretty interesting, but feels more like something a UI should handle...

OK! My understanding was that we were talking about informal mappings of non-latin languages into ASCII scripts, like the Arabic chat alphabet, Greeklish or informal romanizations of Cyrillic.

The wrong-keyboard problem is pretty interesting, but feels more like something a UI should handle...

I agree: this should either be handled on the client side via query expansion, or in the search index via folding. The API layer seems the wrong place to implement this.

Moving this back into the backlog because the ask is unclear.

I'm kicking this out of the REST API user stories, based on conversations with Desktop and our team. Happy to see it adopted elsewhere.

The description still conflates DWIM and cross-script searching, which are completely different things. We have other tickets for DWIM-like functionality, and cross-script searching is much more complex. I can't triage it with this ambiguity, so I'm closing it.