Page MenuHomePhabricator

Bangla letters are getting broken in the search box
Closed, ResolvedPublic3 Estimated Story Points

Description

  1. Go to https://bn.wikipedia.org
  2. Type something in the search box in Bangla e.g.: "ইতাল"
  3. Now please look at the suggested titles in the result. The consonant clusters or conjuncts in the suggested titles (after the word "ইতাল") are getting broke (shown with red highlight in the screenshot below)

search bn.PNG (674×551 px, 83 KB)

In the leacy vector version, we didn't have this issue.

search bn old.PNG (288×361 px, 18 KB)

This bug is very similar to T35242

QA Results - Beta

ACStatusDetails
1T277256#6984798

QA Results -Prod

ACStatusDetails
1T277256#6984803

Event Timeline

From conversation with @TJones on Slack:

As noted in the ticket, it seems similar to T35242. In general, you shouldn't put any markup right before a character that matches \p{Mark}. Since \p{Mark} isn't available in JavaScript you can either import another library that supports all the named regex classes, or define your own. (I defined comboMarks in jquery.highlightText.js last time I looked at this.)
Another option, which I think Google has adopted, is to stop trying to highlight partial matches in some scripts, including most of the Indic scripts.

ovasileva added a subscriber: ovasileva.

Change 674315 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[wvui@master] [typeahead-suggestion-title] Preserve graphemes during splitting

https://gerrit.wikimedia.org/r/674315

phuedx added a subscriber: phuedx.
Jdlrobson set the point value for this task to 3.Mar 23 2021, 5:17 PM

Change 674315 merged by jenkins-bot:
[wvui@master] [typeahead-suggestion-title] Preserve graphemes during splitting

https://gerrit.wikimedia.org/r/674315

Next steps: We need to do a release of wvui and bundle to core. Given @cjming also made some modifications to wvui, perhaps the two of you could pair and do this together ?

@Edtadros: As discussed, you'll need to test this on your local development environment. You'll need to follow the (excellent) instructions in order to do that.

You can either:

  1. Test these changes out via Storybook, which we did during our most recent 1:1; or
  2. Test these stories on your local development wiki
1. Via Storybook
  1. git clone "https://gerrit.wikimedia.org/r/wvui"
  2. cd wvui
  3. npm install
  4. npm run start
  5. Open the "Components / Typeahead Suggestion - Example list with graphemes" story, e.g. navigate to http://localhost:3003/?path=/story/components-typeaheadsuggestion--example-list-with-graphemes
2. On your local development wiki
  1. git clone "https://gerrit.wikimedia.org/r/wvui"
  2. cd wvui
  3. npm run build -- -dw
  4. cd /path/to/mediawiki
  5. cd resources/lib/wvui
  6. rm *.{css,js}
  7. ln /path/to/wvui/dist/*.{css,js} .
  8. Navigate to your local wiki with Vector V2 enabled, e.g. http://localhost:8080/wiki/Main_Page?useskinversion=2

@Edtadros: I've updated the instructions above, per our most recent 1:1 ☝️

Test Result - Beta

Status: ✅ PASS
Environment: storybook
OS: macOS Big Sur
Browser: Chrome
Device: MBP
Emulated Device: NA

Test Artifact(s):

QA Steps

✅ AC1: Verify the component bolds the text after the searched text.

Screen Shot 2021-04-08 at 9.31.16 AM.png (881×829 px, 142 KB)

Edtadros added a subscriber: Edtadros.

Test Result - Prod

Status: ✅ PASS
Environment: bnwiki
OS: macOS Big Sur
Browser: Chrome
Device: MBP
Emulated Device: NA

Test Artifact(s):

QA Steps

Go to https://bn.wikipedia.org
Type something in the search box in Bangla e.g.: "ইতাল"
✅ AC1: Verify the searched text is not bolded, but any additional text in the title is bolded.

Screen Shot 2021-04-08 at 9.47.39 AM.png (795×849 px, 206 KB)

Test Result - Beta

Status: ✅ PASS
Environment: storybook
OS: macOS Big Sur
Browser: Chrome
Device: MBP
Emulated Device: NA

Test Artifact(s):

QA Steps

✅ AC1: Verify the component bolds the text after the searched text.

Screen Shot 2021-04-08 at 9.31.16 AM.png (881×829 px, 142 KB)

Beta looks good. Thank you all for fixing this.

Q: When this will be deployed on bnwiki?

Not on production yet - looks like this still needs to be deployed

Just noting that the remaining work involves updating the version of WVUI in core for this change to take effect.

@Jdrewniak That seems fine, I wait on the Storybook change for the next release.

From my perspective, https://gerrit.wikimedia.org/r/c/wvui/+/676236 is not a blocker for doing a release. The storybook change doesn't seem to add any functionality, only code refactorings.

https://phabricator.wikimedia.org/T277315 and https://phabricator.wikimedia.org/T278880 probably should be considered blockers however, since that's currently in TODO and it doesn't make sense to have to release twice.

Change 679501 had a related patch set uploaded (by Jdrewniak; author: Jdrewniak):

[mediawiki/core@master] Update WVUI to wvui-0.1.1-next.2021-04-14-21-38.0

https://gerrit.wikimedia.org/r/679501

Change 682196 had a related patch set uploaded (by VolkerE; author: VolkerE):

[mediawiki/core@master] Update WVUI to v0.1.1

https://gerrit.wikimedia.org/r/682196

Change 682196 merged by jenkins-bot:

[mediawiki/core@master] Update WVUI to v0.1.1

https://gerrit.wikimedia.org/r/682196

Change 679501 abandoned by VolkerE:

[mediawiki/core@master] Update WVUI to wvui-0.1.1-next.2021-04-14-21-38.0

Reason:

Superseded by Icc08eadcd81493b898c02c8a5ca6c3883ab20e2d

https://gerrit.wikimedia.org/r/679501

@Volker_E I'm not sure how to test this in beta, and I'm not sure storybook shows me anything different than the previous QA test. Do I just wait to test this in Prod?

@Edtadros Was hoping that https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page?uselang=bn would be sufficient, but the articles in the typeahead results need to carry an example. Maybe we could fake it by adding one to a certain namespace, like a user namespace article…

test bnwiki beta wmf.PNG (630×496 px, 32 KB)

Everything looks fine except these two (ক্যালিফোর্নিয়া স্টেট রুট ১৮৮ and ক্যাটায়ন). Search for to reproduce.

ক্‌যা should be ক্যা

I just noticed: legacy vector version also has this problem. It would be great if this could be fixed.

TL;DR: The best solution is probably to turn off partial-match highlighting for wikis (like bnwiki) written primarily in scripts that have conjucts that don't do well when split by highlighting, but leave the generic combining-mark glomming logic in place for other wikis (like enwiki) that have highlighting and titles in such conjunct-using scripts, so the results don't look entirely ridiculous.


Everything looks fine except these two (ক্যালিফোর্নিয়া স্টেট রুট ১৮৮ and ক্যাটায়ন). Search for to reproduce.

ক্‌যা should be ক্যা

I just noticed: legacy vector version also has this problem. It would be great if this could be fixed.

If I can jump in here... for those who aren't real familiar with Indic scripts, the conjunct ligatures are special forms for when two characters sort of merge together. There's nothing quite comparable in English, but it would kinda be like splitting and bolding half of æ as ae or w as vv, but worse.

Conjuncts and highlighting have always been a problem. You can still see it on English Wikipedia—which has plenty of title redirects in Bengali and other Indic scripts. Searching for ক gives one example just like the one @Aftabuzzaman had:

Screen Shot 2021-04-28 at 5.03.47 PM.png (446×418 px, 18 KB)

The problem is a combination of things. Font rendering systems do okayish with Latin and maybe Cyrillic when you have characters and diacritics where one half is bold and the other half isn't. So é with a bold e and normal accent is this: é and vice versa is this: é. Not perfect, but it doesn't look terrible like ক্যা does with its dotted circle. If only Indic font rendering were better we wouldn't even have this problem!

The generic solution implemented here prevents these really terrible cases with the dotted circles. A smarter implementation that knows about specific conjuncts would probably have to be language specific and done for each language with conjuncts and other complex ligatures.

Google used to have this problem, too, but their solution, which I also suggested above, is to not do partial-match highlighting for some scripts, like the Indic scripts. Here is a screen shot of a search for ক্:

Screen Shot 2021-04-28 at 5.32.46 PM.png (234×242 px, 2 KB)

I'd like to point out that even with highlighting turned off on bnwiki, the fix here would still be useful on larger wikis with highlighting enabled (like English Wikipedia), because it would give completion matches like ক্‌যা, which is worse than ক্যা, but much better than ক্যা, I think.

</2¢>

@Volker_E / @Jdlrobson, I verified that I see the same errors pictured in T277256#7043232 and T277256#7043719 on https://bn.wikipedia.beta.wmflabs.org/wiki/Main_Page. If the purpose of this task was specifically to get rid of the underlined character here:

search bn.PNG (674×551 px, 83 KB)

then I can progress this to QA in Prod. But that doesn't fully address the errors above that don't deal with the circle character underlined above. Let me know how you'd like to proceed.

Discussed in standup today - will follow up with a separate task that will display the results as not bolded throughout