Page MenuHomePhabricator

Telugu search not transliterating correctly
Open, Needs TriagePublic

Description

Background

First reported in https://www.mediawiki.org/wiki/Talk:Reading/Web/Desktop_Improvements#Tewiki_-_Issue_with_Telugu_typing_in_Search_box
Using Vector 2022 on https://te.wikipedia.org/wiki/%E0%B0%AE%E0%B1%8A%E0%B0%A6%E0%B0%9F%E0%B0%BF_%E0%B0%AA%E0%B1%87%E0%B0%9C%E0%B1%80:

  • typeahead search is not correctly transliterating roman to Telegu script
  • letters do not appear correctly
Open questions
  • Was this working as expected prior to Desktop Improvements / TypeaheadSearch implementation?

Event Timeline

ldelench_wmf renamed this task from Telegu seach not transliterating correctly to Telugu search not transliterating correctly.Oct 4 2022, 1:43 PM
ldelench_wmf updated the task description. (Show Details)

A few relevant pieces of info from the original description:

  • "The search box does not transliterate the Roman skript to Telugu" -- this suggests the user is using the Windows "Telugu Phonetic" input method specifically, as MacOS does not use transliteration from latin script. The way it works is that you type latin characters and then it provides its own popup with suggested transliterations - you can see this in action here (4:57 in the video)
  • Timing may be implicated somehow: the user mentions: "...as soon as the [page is loaded. If I leave the page and come back immediately, then it starts working."
  • "It is not the case in the old Vector (High priority)" - this is a regression from previous search box behavior in older Vector
  • "when I move to to type new word, the first letter is erased, and only the subsequent letters are displayed " - this appears to result from replacing the actual contents of the input field itself for some reason
  • "Some of the typed letters do not appear as they are supposed to. Ex: when I type ree,vee,kee etc., they are supposed to appear as రీ,వీ,కీ. Instead they appear as రి,వి,కి (They are "printed" correctly, but don't appear correctly)." -- by this the user means ర (ra) + రీ (long i / ī) produces ర (ra) + ిి (short i) in the input field -- this is a different vowel.

A few notes about input methods and key events:

  • Generally speaking with this kind of input, the key thing to listen for here is the

compositionend event, described here - this tells you that the transliteration popup is finished transliterating the text from latin characters to the target input script. Text fields differ on this slightly (Google autocomplete does consult intermediate state), but generally speaking you should not be offering autocomplete suggestions for the intermediate state, as the user has not finished typing yet.

  • keydown is generally disfavored compared to keypress if you are going to steal individual key events, but as noted this is probably itself not a good idea.

Comparison to other input method behavior:

  • Hindi on macOS also uses transliteration in this way, and doesn't seem to have this issue.
  • Chinese input is similar, and also seems to work on macOS - it appears compositionend is respected for these input methods

So - this may be:

  • A windows-specific issue, and/or
  • Specific to another browser, and/or
  • Perhaps most likely given the description, somehow a timing issue - the user reports that input worked after returning to the page. This is strange, as I'm not sure what would cause the timing behavior described here, aside from potentially stealing keydown or keypress events at the wrong time.

I have not been able to test/verify on a windows machine myself, but hopefully the above is helpful!

So - some specific thoughts on recommended next steps:

  • Attempt to reproduce on a Windows machine using Telugu Phonetic keyboard layout -- Run through the below test scenarios
  • Stress test with poor network connectivity, high memory/disk usage, and large page content
  • Check with MacOS also for comparison's sake, as well as Firefox/Safari
  • Check with both logged out and logged in users
  • Check both legacy vector and modern using "?useskin=vector-2022" in the URL
  • Also test Hindi and Chinese for reference (see below)
  • If we are able to reproduce , compare event listener code between implementations. Also check other composition-based input methods such as Hindi/Chinese for regressions

Pre-test setup:

  1. (On a Windows machine) Enable the Telugu Phonetic keyboard. Instructions for doing this here (in Telugu but they show the steps).
  2. Open https://en.wikipedia.org/wiki/Main_Page?useskin=vector-2022 in private browser mode and as a logged in user
  3. Also open https://en.wikipedia.org/wiki/Main_Page in private browser mode and as a logged in user who does not have vector-2022 set as the default skin

Test Cases:

TestExpectedNotes
Type "telugu"A small menu should appear above the input field, with multiple suggestions. "తెలుగు" should be present, along with other options. Hitting "return" will accept this suggestion, and should enter "తెలుగు" into the input field - there should no longer be any latin text in the input field. Search results should include the result for the page https://en.wikipedia.org/wiki/Telugu_languageN/A
Type "r", then shift+i, then hit return to accept suggestion from menu (if you see a menu)రీ should appear (note the top twirl), not రిUser noted that there is no distinction between long and short "i"s in the new search - that is, that though they typed రీ , they saw "రి" in the input field. These are two separate vowels and are not equivalent.
Type "telugu", then select "తెలుగు" from the menu, as in the first test case. Then hit space, and attempt to type "lipi". లిపి should appear in a dropdown, and you should be able to hit return to accept this suggestion.The input field should now read "తెలుగు లిపి" - the earlier-entered text should not disappearuser notes that the తెలుగు portion disappeared when selecting the "లిపి" portion

Other things to check:

TestExpectedNotes
Enable Hindi Phonetic input method. Type "hindi".You should see a selection window appear with हिंदी as an option. Hit return to accept this suggestion. हिंदी should appear in the input field, and search results should show https://en.wikipedia.org/wiki/Hindi as the top suggestion.Any fix for the Telugu issue will also affect the intermediate state for Hindi
Enable Chinese Simplified - Pinyin input method. Type "nihao"You should see a selection window appear with 你好 as a suggestion. Hitting return should select this option. The field contents should be replaced with 你好, and the first result should be https://en.wikipedia.org/wiki/Ni_HaoChinese behavior should be the same as Hindi and Telugu once an option has been selected. Note that you should not see intermediate search results for "n", "ni", or "nih"

I have not myself double checked any of this on a Windows machine yet, so there may be slight differences in expected results, as the above is based on macOS conventions!

Hey @EUdoh-WMF, we discussed this with the Web team today (notes) and are wondering if you would be able to help us reproduce this issue on a Windows machine?

This is an IME issue, so CCing @dchan as the expert on all things IME.

In T319208#8283191, @NHillard-WMF wrote:
  • Timing may be implicated somehow: the user mentions: "...as soon as the [page is loaded. If I leave the page and come back immediately, then it starts working."

I wonder if what's going on here is that the user has already started typing into the plain input, and then Codex loads and replaces the plain input with a TypeaheadSearch input. It restores the value in the input, but this would interrupt the IME. This theory would be consistent with typing being disrupted once, when this load happens, but typing otherwise working normally. But the bug report sounds like there's ongoing disruption/breakage after load too, so maybe this isn't what's happening (or it is happening, but something else is happening too).

  • Generally speaking with this kind of input, the key thing to listen for here is the

compositionend event, described here - this tells you that the transliteration popup is finished transliterating the text from latin characters to the target input script. Text fields differ on this slightly (Google autocomplete does consult intermediate state), but generally speaking you should not be offering autocomplete suggestions for the intermediate state, as the user has not finished typing yet.

Unfortunately, compositionend doesn't work well in practice with a number of IMEs, and there's no generally applicable way to tell when a user has finished typing vs what is an intermediate state. See T295166 for more discussion on this topic.

  • keydown is generally disfavored compared to keypress if you are going to steal individual key events, but as noted this is probably itself not a good idea.

That's not what MDN says, it says that keypress is deprecated, and that it doesn't fire for non-character keys like arrows and Home/End (which we need to capture).

Our code doesn't steal (prevent the default action) on any key events except for Enter, Tab, Home, End, ArrowUp, ArrowDown and Escape (and in theory Space sometimes, it looks like that is never prevented but the logic is confusing, I'll try to improve this code).

I wonder if this particular IME does weird stuff where it moves the cursor to the start/end of the input and generates Home/End key events as it does so. I believe @dchan has seen other IMEs do this. If that's the case, maybe T314728 will fix this bug as a side effect. This is a bit of a guess, but we could test this theory by comparing the behavior of the PatchDemo wiki for that bug to the behavior in production. If the PatchDemo wiki behaves correctly (or at least better), then this is (part of) the problem.

For now, I think the next steps should be:

  • Ask the user which IME and which browser/OS they're using (we've been assuming it's the Telugu Phonetic keyboard layout on Windows, but let's confirm that)
  • Reproduce the bug ourselves, so we can verify that user-specific things like gadgets or ULS or whatever aren't interfering, and this really is a bug in our interaction with that IME. We should try to reproduce in these four environments:
    1. On Telugu Wikipedia in production
    2. On the Codex demo site
    3. On the PatchDemo wiki for the proposed fix for T314728
    4. On the patch preview for the proposed fix for T314728
  • My hope is that #1 and #2 behave the same (both broken the same way as described in the bug report), and #3 and #4 both behave correctly. If that's the case, then this bug will be fixed as a side effect of fixing T314728, and we're done.
  • Otherwise, we'll have to do more investigation. The first step would be to capture a full event log of all the events the IME sends by typing the text we've been testing with into this tool, and then analyzing its output.

Agreed with all of this - my mistake on keypress (this was off the top of my head / without confirmation, and thus incorrect!), and agreed about general spotty handling of key events in a multilingual context.

I should have led with this -- there is unfortunately a great deal of complication around key event handling as that other other issue indicates. Cross-browser, cross-OS, and cross-IME are all considerations to consider here and are all different. Being as conservative as we can / deferring to previous behavior wherever possible is critical here.

I briefly tested this out this morning and I was able to reproduce what may be a similar issue.

Check the attached video of using Chinese input on MacOS:

To note:

  1. As you navigate the candidate window using arrows, both the highlighted candidate in the candidate window and the highlighted result move focus -- compare to Google's autocomplete where only the candidate window moves (this in turn is likely tied to the internal state of the input field)
  2. After hitting "enter" at the end, the search results are only those for the first character, 北, not for the contents of the input field, 北京 - the second character has been dropped

As to what might be contributing to this:

Our code doesn't steal (prevent the default action) on any key events except for Enter, Tab, Home, End, ArrowUp, ArrowDown and Escape (and in theory Space sometimes, it looks like that is never prevented but the logic is confusing, I'll try to improve this code).

The preventDefault() / stopPropagation() on ArrowDown may actually be partly to blame here, or at least result in this bug I'm mentioning here -- in order to navigate to candidates within the input method window, it's very common to hit the down and up arrows in order to highlight the desired option before selecting (they can also select with the mouse, or at least on MacOS, with the number keys, but these are typically less common). Currently as you hit the down arrow, it also changes the focus both within the candidate window and within the autocomplete dropdown. Further, if you perform the search after this, the search proper does not include the character you have chosen.

(Small side note: on MacOS the majority of candidate windows are horizontal (though some have the option to make the candidate window vertical), so we don't see this as often, but on Windows, the default for phonetic-style input (which all Indic languages have, not just Telugu) is a vertical window -- I used the vertical window here to highlight the problem on a mac as I don't currently have a working Windows machine to test this with)

I'm adding this here as it might be related - particularly vis-a-vis handling of up/down and enter, but this may end up being a separate issue, in which case we should spin this out.

As Roan says, we should start by reproducing with Telugu Phonetic on Windows, which is the scenario that the user noted (small note, this is likely Windows because MacOS does not have Telugu transliteration as an input method - in theory, it could also be Linux though - Agreed if we can follow up with the user to ask it might not hurt here!)

In T319208#8288645, @NHillard-WMF wrote:
  1. After hitting "enter" at the end, the search results are only those for the first character, 北, not for the contents of the input field, 北京 - the second character has been dropped

I think what may actually be happening here is that pressing "Enter" selects the highlighted item. In your video, at the time you pressed enter, the last menu item (the footer that says "Search for pages containing 北") was highlighted, and pressing Enter on that item searches for pages containing 北, which is what happened to you. But the footer item is treated differently from the others: if you arrow to a different item I think pressing enter would not behave incorrectly at all, because in that case we just let the enter keypress happen in the assumption it'll submit the form, which in this case it won't.

Could you confirm this by doing the same test but with the arrow keys landing on a different item that isn't the footer?

If my theory is correct, then we could address this either by not listening to Enter keypresses (and other things like arrows) while composition is active (if that's something we can reliably detect), or by not listening for Enter at all and instead listening for form submission.

Could you confirm this by doing the same test but with the arrow keys landing on a different item that isn't the footer?

When I highlight an item that is not the footer, I'm seeing similar behavior, but even the first character is not included -- the search result is just blank:

Screen Shot 2022-10-06 at 9.50.06 AM.png (2×2 px, 790 KB)

You can test this locally through the following steps on MacOS:

  1. Go to the Keyboard preference pane, click on "Input Sources"
  2. Click "+", search for "Chinese, Simplified", highlight "Pinyin - Simplified" , and hit "Add"
  3. Still in the Input Methods tab, highlight "Pinyin - Simplified" to the left, then move the dropdown that says "Candidate Window: Orientation: Horizontal" to "Vertical"
  4. In the menu in the upper right of the screen, change "US" to "Pinyin - Simplified" to activate the input method
  5. Go to https://en.wikipedia.org/?useskin=vector-2022 , and type "ni" on the keyboard. Select any candidate, via mouse or the enter key.
  6. Now type "hao", and then hit the down arrow any number of times. Hit "enter" to select any candidate that is not the first.
  7. Hit "enter" again to submit the form, notice that the result is blank.

Tested with @NHillard-WMF and @EUdoh-WMF in Chrome and Firefox on Windows 10 and was not able to reproduce. For next steps, could we follow up with Chaduvari in the talk page thread and see if they would be open to recording a screenshare for us?

I will be spinning out the Chinese issue cited above into a separate ticket shortly - it is likely a different issue

Is this issue still present? I don't see instructions for how to reproduce in the main task.

We discussed this just now in DST refinement. In short, we are not sure if this is in Typeahead search or not, but before this, we're not sure if this issue is reproducible outside of this particular user's report. For next steps, we need a QTE engineer to attempt to reproduce the specific issue identified with the "Telugu Phonetic" keyboard layout on a real Windows PC.

Ezekiel did attempt to reproduce when this issue was first reported, and was not able to, but he was up against several limitations that we need to keep in mind as we look to reproduce:

  1. Browser Stack does not offer the ability to test input methods, this must be tested on a real physical PC running Windows.
  2. This is the "Telugu Phonetic" keyboard variant (there are several Telugu keyboards, this is the particular version that works to transliterate roman characters into the Telugu script).
  3. The QTE engineer should have familiarity with and/or work closely with an expert in input methods in order to accomplish this work and double check assumptions. Familiarity with testing Visual Editor or working with the Editing team is a plus here.
  4. Work through the repro steps outlined here: https://phabricator.wikimedia.org/T319208#8283662
  5. You may need to follow up with the original user to discuss or refine this issue further. If you have tested 1-3 above and not found issues, report back here and the Web Team can work to reach out to this particular user to give additional information.

As a starting point, @ovasileva Would you be able to set a priority on this issue? (we're planning to dupe https://phabricator.wikimedia.org/T327070 here, at which point arguably this issue should be in the web board as the other one is).

Once we have this, we can assign to JR to send to QTE for triage before distributing to one of the two teams

One problem remains -
The search box does not transliterate the Roman skript to Telugu as soon as the page is loaded. If I move out of the search box and return, then it works. Similar behaviour is noticed when the page is opened in Source edit mode - when the page is opened in source edit mode (2017 Wikitext editor), tranliteration does not work immediutely. It works after some pause or after I move out of the tab and return. This behaviour is not noticed in Visual edit mode. Thanks.