Page MenuHomePhabricator

Enable DWIM support for Vue.js search
Open, HighPublic

Description

DWIM is a default gadget on hewiki. When a user performs a search that returns fewer than 10 results, any characters in the query string from the US English and Hebrew keyboards are swapped with their corresponding characters and a new search is run. For example, כרקג ("chrkg", which is probably fairly meaningless) becomes fred and tkpch, (with the comma) becomes אלפבית ("alphabet"). The results of the newly mapped search are appended to any existing results.

The implementation is tiny. See the source code, especially:

Keyboard Mapping
var hes = "qwertyuiopasdfghjkl;zxcvbnm,./'קראטוןםפשדגכעיחלךףזסבהנמצתץ";
function hebeng( str ) {
  return ( str || '' ).replace( /./g, function( c ) {
    var ic = hes.indexOf( c.toLowerCase() );
    return ic + 1 ? hes.charAt( ( ic + 29 ) % 58 ) : c;
  } );
}

[Russian Wikipedia has a version of DWIM that can be enabled in user preferences.]

How should we build it?

Here are some possible approaches:

  1. Bake it into the network client (T244287) for MediaWiki maintained code, WVUI initially. Now we gotta handle all languages or at least Hebrew initially. We do this for the icons, should we do it for search? This makes the library (or wherever it's implemented) larger and increases the amount of code we formally maintain, but gives us full control, becomes universally available, and doesn't require us to maintain an API for gadgets out the gate.
  2. Expose, document, and support an API. A gadget will then have to be written by someone to use said API but not necessarily the Vue.js search team. However, any intentional or unintentional complexity will surely backfire on us as people will hack around it.
  3. Do nothing / wait for backend support. This would be a regression over the current experience for DWIM users so it's not possible. Evan said this was too tricky for the backend to do right now (we asked at All Hands). See T245677 for some exploration notes from the API team.
  4. Something else?

Option 1 and 3 seems mostly straightforward for at least the initial deployment. Option #2 seems more complex so it gets its own section.

Option 2: expose an API

The API provided by option #2 would need at least the following seams: a) query changed b) fetch complete c) fetch results for arbitrary string and d) append results.

b is the event emitted when any fetch completes. This would let the gadget author know when to issue a request. c is the function that a gadget would call to initiate a subsequent search request. d is an interaction to actually add the results from 3 to the UI. So, an actual use might look like:

  1. User types "tkpch,".
  2. A search for "tkpch," is requested.
  3. The search is performed, fetch resolves, and the UI is updated with any results.
  4. A "fetch complete" event is dispatched that includes the results.
  5. The gadget is listening for the event and sees the result count for "tkpch," is less than 10. It performs the keyboard mapping and issues a new search request for "אלפבית".
  6. The search for "אלפבית" completes and the gadget appends the result to the UI.

Additional notes:

  • Does this require a global Vue.js application instance for authors to hook into? If not, how will they reference the network client, for example?
  • Is a search(query: string): void and append(results: RestSearchResponse): void exposed?
  • I assume that query changed and fetch complete / canceled / failed API would dispatch events.

Event Timeline

ovasileva lowered the priority of this task from High to Medium.Sep 10 2020, 5:28 PM

I suggest we decline this.
I think this should remain a gadget and it can continue to be a gadget once the search has landed. This functionality doesn't seem like critical functionality.

If this code belongs anywhere it belongs in the server side. If I search on Google with a typo or a term that returns no search results, Google detects and handles that use case. Likewise I think our backend could detect you are searching in English on a Hebrew and adjust the query terms there if necessary. Pulling in @EBernhardson to see how practical that would be.

Regardless we shouldn't do this in our client. Such code would need to be specific to Hebrew Wikipedia or shipped unnecessarily to other users and is not in our interest to maintain.

Moving to triaged but future. This is currently lower priority when compared to the remainder of the vue.js search work and not considered a blocker for deployment.

@TJones is the most relevant reference, there are probably a few posts in phab from him already detailing this.

Sorry... lots of other things have been happening.. I'll try to give a semi-thoughtful reply tomorrow!

TJones updated the task description. (Show Details)

I've updated the task description to more accurately describe what DWIM does. It is not transliteration or translation, and the previous tomato and matrix examples don't really illustrate it. The mappings are based on keyboards, not alphabets, too, and I believe the Hebrew and Russian DWIM code both assume the US keyboard. (Using the UK or French or German keyboard for the Latin half of the equation would give different results.)

Note that when someone uses the wrong keyboard, the result is usually gibberish. Another example from Russian/US mapping is ,jutvcrfz hfgcjlbz—with the leading comma—which is what you get when you try to type богемская рапсодия ("bohemian rhapsody" in Russian) when your keyboard is set to the US English layout.

I suggest we decline this.
I think this should remain a gadget and it can continue to be a gadget once the search has landed. This functionality doesn't seem like critical functionality.

I wouldn't say that it is critical, but it is very useful for people who have to switch keyboards regularly. Since wikitext and HTML/CSS require the Latin character set, editors on Hebrew, Russian, and Chinese wikis, for example, have to switch keyboards regularly. Looking at the source code for the main page and any random pages for those Wikipedias shows templates and HTML/CSS in the Latin character set.

An analysis that I did a while back showed that a little more than 1% of queries on Russian Wikipedia are wrong-keyboard queries that DWIM could try to fix. 1% of queries is a big deal in search; it's hard to improve that many queries at once in a reasonably mature search system. Since the Russian DWIM is not on by default, these don't get fixed automatically, alas.

(I think it's a good idea to skim Amir's presentation on English as privilege once a year or so—it's easy to forget that things that are easy in English or in the Latin alphabet can be surprisingly more difficult elsewhere.)

Anyway, I would like to suggest that at a minimum, you make sure that Vue.js updates don't make it very difficult or impossible to implement a DWIM gadget. For whatever historical reasons, the search in the upper corner and the search on the Special:Search page use different frameworks. The Special:Search search box uses OOUI and it broke DWIM, so it had to be disabled there (giving users a poorer and different search experience in that search box). A request to fix it and some discussion about how difficult it would be is in T215346. I don't think it ever got fixed.

If this code belongs anywhere it belongs in the server side. If I search on Google with a typo or a term that returns no search results, Google detects and handles that use case. Likewise I think our backend could detect you are searching in English on a Hebrew and adjust the query terms there if necessary. Pulling in @EBernhardson to see how practical that would be.

It's important to note that wrong-keyboard searches are not English or Hebrew or Russian, they are usually gibberish. Consonants and vowels don't line up, and sometimes letters on one keyboard map to punctuation on the other, so it looks like junk. And while Google is by default a reasonable model of decent UX (and what users are likely to expect), our search team has somewhat fewer resources than they do, so we can't always keep up.

That said, I have worked on being able to detect it for Russian, but we've never gotten around to doing the integration. Also, having it as a gadget enables the community to solve the problem, and takes the onus of the search team to know about all such possible mappings (including which keyboards are most likely being used, since UK and US keyboards differ, for example). The Russian DWIM still uses the Hebrew variable names, so you can tell they copied it over—and no big surprise, since the overlap of Russian and Hebrew speakers is far from rare.

Regardless we shouldn't do this in our client. Such code would need to be specific to Hebrew Wikipedia or shipped unnecessarily to other users and is not in our interest to maintain.

When I originally discussed this with Jan at All Hands—which is how I'm guessing the issue eventually got here—the most important thing I saw was not making DWIM overly difficult to update/re-implement, the way the OOUI changes to Special:Search did. That seemed to morph into T245677, which also conflated DWIM and transliteration. Just not irreparably breaking DWIM would be a good thing for the Hebrew- and Russian-language wiki communities.

Thanks for this information @TJones - it's really useful.

Okay so now we're a bit further along here, I can confirm that DWIM gadget breaks with the new search experience.
I've patched the Hebrew gadget to prepare for this event for now. Similar patches would be needed for the other gadgets Jan identified on https://www.mediawiki.org/wiki/User:JDrewniak_(WMF)/notes/Search_gadget_catalogue#Dwim_gadgets

I'm going to spend some time seeing how this gadget might work with the new code.

Change 649991 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[wvui@master] DWIM support in wvui

https://gerrit.wikimedia.org/r/649991

Currently there is no way to replicate dwim in the new Vector search interface. Looking closely at what it does, it needs to be able to inspect the return value of the api and to trigger a new query when there are less than 10 results

To make this possible, either;

  1. A hook/global function could be added to allow gadgets to define a function that can correct a query when < 10 results are returned.

OR

  1. The functionality should be baked into wvui

The code differs depending on the language used. According to https://www.mediawiki.org/wiki/User:JDrewniak_(WMF)/notes/Search_gadget_catalogue#Dwim_gadgets hebrew, russian and arabic special casing is required. This is around 0.2kb (see associated patch)

Change 650560 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[wvui@master] Minimal "Do What I mean" support in wvui

https://gerrit.wikimedia.org/r/650560

Change 650564 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/Vector@master] "Do What I mean" support in wvui

https://gerrit.wikimedia.org/r/650564

Jdlrobson raised the priority of this task from Medium to High.Dec 18 2020, 7:49 PM

Given our deploy to Hebrew I think at least adding support in the core library should be a blocker for deployment. (https://gerrit.wikimedia.org/r/650560)

Change 649991 abandoned by Jdlrobson:
[wvui@master] "Do What I mean" support in wvui

Reason:
Minimal version at https://gerrit.wikimedia.org/r/c/wvui/ /650560

https://gerrit.wikimedia.org/r/649991

Change 651615 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/skins/Vector@master] Gadgets can change the search API

https://gerrit.wikimedia.org/r/651615

Change 650564 abandoned by Jdlrobson:
[mediawiki/skins/Vector@master] "Do What I mean" support in Vector

Reason:
See https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/ /651615

https://gerrit.wikimedia.org/r/650564

Change 650560 abandoned by Jdlrobson:
[wvui@master] Minimal "Do What I mean" support in WVUI

Reason:
See https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/ /651615

https://gerrit.wikimedia.org/r/650560

Change 651615 merged by jenkins-bot:
[mediawiki/skins/Vector@master] Gadgets can change the search API

https://gerrit.wikimedia.org/r/651615

@TJones gadgets will be able to replace the API client but tbh that's not ideal and a little fragile - it could easily break with any developments to the search frontend.

In the longer term, what exactly are the blockers for moving this code into the search API ? I audited the gadgets and the only ones in the wild seem to do this for Hebrew, Arabic and Russian with the code in https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/650564/3/resources/skins.vector.search/dwim.js. This doesn't feel like something that belongs in the client TBH

assigning to olga per standup discussion around what we want to do next.

@Jdlrobson & @ovasileva—sorry for the late reply; there was end-of-year holiday-making and then this week has been busy and … distracting.

In the longer term, what exactly are the blockers for moving this code into the search API ? I audited the gadgets and the only ones in the wild seem to do this for Hebrew, Arabic and Russian with the code in https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/650564/3/resources/skins.vector.search/dwim.js. This doesn't feel like something that belongs in the client TBH

We’ve done some work on this kind of thing (T138958 and T155104), but our approach was more involved, less “intrusive”, and with much higher precision. It got put on the back burner for various reasons, including the fact that these gadgets exist. So, while we have looked at addressing the same problem, we have a very different technical approach.

As for whether it should be in the client, I see your point, but that was the only place users could effectively implement it, and it was originally a user-created feature.

Another, much smaller factor is that people could, in theory, modify the gadgets and change the relevant keyboards. Both DWIM gadgets assume American keyboards, if I recall correctly; a user could copy the code and customize it to use a British or French or German keyboard for the Latin half, for example.

My main concern in the short term is that users obviously find this feature worth having, and breaking or removing it feels like a regression. But if there are bigger plans that DWIM is blocking, it will probably have to be sacrificed—though the user communities should be informed, I think. In that case, we can look into re-prioritizing the problem on our side, and possibly building a similar-style solution instead of the one we originally preferred.

https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/650564/3/resources/skins.vector.search/dwim.js.

BTW, I see that this patch was abandoned, but was it implemented anywhere else? I ask because I think there is an error. mapToArabic() should use HES_AR.charAt( ( ic + 33 ) % 66 ) because HES_AR is 66 characters long. In general, that bit of code is (ic + HES_XX.length/2) % HES_XX.length for the the relevant value of XX (obviously with either magic numbers as now, or the division outside the loop).

But if there are bigger plans that DWIM is blocking, it will probably have to be sacrificed—

I think the challenge here is that Vue doesn't lend itself very well to the monkey patching approach that was so common in jQuery. Trying to allow gadgets to customize the experience comes at the cost of reducing the stability of the code.

BTW, I see that this patch was abandoned, but was it implemented anywhere else? I ask because I think there is an error. mapToArabic() should use HES_AR.charAt( ( ic + 33 ) % 66 ) because HES_AR is 66 characters long. In general, that bit of code is (ic + HES_XX.length/2) % HES_XX.length for the the relevant value of XX (obviously with either magic numbers as now, or the division outside the loop).

This was likely based on the Arabic gadget https://ar.wikipedia.org/wiki/%D9%85%D9%8A%D8%AF%D9%8A%D8%A7%D9%88%D9%8A%D9%83%D9%8A:Gadget-Dwim.js so there's probably an issue there.

This was likely based on the Arabic gadget https://ar.wikipedia.org/wiki/%D9%85%D9%8A%D8%AF%D9%8A%D8%A7%D9%88%D9%8A%D9%83%D9%8A:Gadget-Dwim.js so there's probably an issue there.

I didn't know that existed. Indirectly, it's an argument for getting rid of the gadgets, too—since this will not work very well because it looks to be using the wrong mapping.

Change 641052 had a related patch set uploaded (by Jdlrobson; owner: Catrope):
[mediawiki/core@master] Add wvui (0.0.2-next.2021-01-11-22-44.0)

https://gerrit.wikimedia.org/r/641052

Change 641052 had a related patch set uploaded (by Nray; owner: Catrope):
[mediawiki/core@master] Add wvui (0.0.2-next.2021-01-11-22-44.0)

https://gerrit.wikimedia.org/r/641052

(not very useful comment):

as the original author (and "namer") of DWIM on hewiki, i find this discussion exciting. it will be great to see DWIM behavior integrated into the search, without having to deal with it locally.
i wanted to note in passing, since google searches were mentioned in different posts:
indeed, google does that, and was the inspiration and model for DWIM. the original idea/request came from a user who asked us to emulate google's behavior.

google's dwim is actually much more powerful than that, since it simultaneously looks for result in multiple languages:
type ",jutvcrfz hfgcjlbz", google will offer "богемская рапсодия" (Bohemian Rhapsody in Russian), and if you type "rpxushv cuvnh", it will offer "רפסודיה בוהמית" (Bohemian Rhapsody in Hebrew), without having to tell google "my 2nd preferred language is Russian/Hebrew/Whatever".
indeed, latin alphabeth/keyboard is not even part of the deal: typing "תחואהברכז יכעבחךנז" will bring "богемская рапсодия", so it seems that when there are not enough "hits", google tries many "DWIMs" until it hits gold.

this actually _can be done_ server side, and maybe it's the right solution.
one way to do it is to "canonize" the search string (possibly into keycodes, rather than characters), and keep a "canonized" table of article names, so regardless whether the hapless user forgot her keyboard on hebrew,, russian, urdu or bulgarian, and regardless which language this wiki uses, we'll get all the hits.

peace

While I also support the view that this should be done on search level instead, maybe this would be useful here:
https://github.com/ai/convert-layout (11 languages, haven’t checked how well it works)

this actually _can be done_ server side, and maybe it's the right solution.
one way to do it is to "canonize" the search string (possibly into keycodes, rather than characters), and keep a "canonized" table of article names, so regardless whether the hapless user forgot her keyboard on hebrew,, russian, urdu or bulgarian, and regardless which language this wiki uses, we'll get all the hits.

Interesting idea. For those, like me, who didn't quite follow, there are constant keycodes associated with particular positions on the keyboard. So, on an AZERTY keyboard, the "a" key reports KeyQ, and the "z" key reports KeyW. A hypothetical keyboard with all writing system keys is in this W3C paper as Figure 13. Here's a little webapp that will show you the event.code values. Other keyboards—e.g., Hebrew and Russian—report these values, too. They wouldn't work for other input systems (e.g., some Chinese input systems), but for a lot of alphabets they would.

It would require doubling the index for the completion suggester, which may be a deal-breaker.. but it's an interesting idea.

BTW, @Kipod, do you have a reference for Google using this approach? I tried some test cases, and I'm not convinced that's how they do it. I think it may just be based on user self-corrections—which they may have enough data for. Or it may be a mix of things. There seem to be enough Americans in France typing q,ericqn on an AZERTY keyboard that they suggest q ericqn qirlines.

BTW, @Kipod, do you have a reference for Google using this approach? I tried some test cases, and I'm not convinced that's how they do it. I think it may just be based on user self-corrections—which they may have enough data for. Or it may be a mix of things. There seem to be enough Americans in France typing q,ericqn on an AZERTY keyboard that they suggest q ericqn qirlines.

Some simple end-user-level examples:

I typed הקלהקא ומגקרערםומג while using the Hebrew keyboard layout, which would be "Velvet Underground" if the Latin QWERTY layout was selected. Here's how it looked like on Google, even before going to the results page:

I typed ye, gjujlb while using the QWERTY layout, which would be "ну, погоди" if the Russian layout was selected. That's a name of a popular Russian cartoon series:

The results are mostly Russian variants of searches for these cartoons, and even some suggestions for this cartoon with a Hebrew translation. I guess that many people in Israel search for that and Google's algorithms figured it out somehow. (I should also note that if I search for the same thing without the comma, I get much more mixed results.)

Important caveat: I am using an Israeli IP, my chosen language in the Google account is Hebrew, and my browser's Accept-Language is also Hebrew. People with different conditions may get different results.

Something to know: DWIM gadget on hewiki or ruwiki is using each character to character 1:1 mapping, but it cannot be applied to all languages. When I typed eogksalsrnr(11 characters) on Google while using QWERTY layout, which would be "대한민국"(4 characters) if the Korean layout was selected, It shows results for "대한민국". More Korean examples: dkrl → "아기" (2 characters), dkr → "악" (1 character). Suggestions like this are very usual things on the Korean web ecosystem and are not easy to be implemented so libraries are developed. Chinese characters have more trobles. Typing tosinn (Japanese, indeed) on Google shows results for "都心", "トシン" and "東進" at the same time because these words are all be represented as tosinn.