Page MenuHomePhabricator

Search suggestion highlighting does not respect grapheme clusters causing wrong rendering for Arabic and Indic scripts
Closed, ResolvedPublic

Description

Certain glyphs dont render properly in few places and it occurs randomly.(See attachement) Fonts are not an issue. Most of us who tested across browsers, OS have tamil fonts and a good number of them. This issue is reproducible only on wikimedia sites.

Ravi noticed this rendering bug in the search box on Ubuntu 11.10 and FF8
Bala noticed this rendering bug in the search box on Win XP SP3 and FF5,IE7
I notice this rendering bug in the RecentChanges page on Ubuntu 11.10 and Chrome 16


Version: unspecified
Severity: normal
URL: http://ta.wikipedia.org/wiki/Special:RecentChanges
See Also:
T41101: ULS Search auto-complete suggestions over write the input in Indic

Details

Reference
bz33242

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 11:59 PM
bzimport added projects: MediaWiki-Search, I18n.
bzimport set Reference to bz33242.
bzimport added a subscriber: Unknown Object (MLST).

sodabottle wrote:

screenshot of recent changes page

Attached:

sodabottle wrote:

screenshot of search box dropdown list (FF5, XP SP2)

Attached:

this appears to be working for me now. Could you give me the the key sequence necessary to produce a transliterated sequence that produces the screenshot in attachment 9735 ?

sodabottle wrote:

I am still getting the broken glyphs. The letters seen in attachment 9735 are ”குவ” (can be reproduced in transliteration typing scheme by keystrokes kuva). Occurs with Narayam input scheme or external input scheme or just copypasting the text

(In reply to comment #3)

this appears to be working for me now. Could you give me the the key sequence
necessary to produce a transliterated sequence that produces the screenshot in
attachment 9735 [details] ?

I can confirm this as a bug in search suggestions.
To produce - In ta.wiki, select inputmethod as Tamil99, in search box, type k ,
You will get ம, and suggestions.

Here is the code for the first suggestion.
<div class="suggestions-result suggestions-result-current" rel="0"
title="மீட்டர்">
<span style="white-space: nowrap;"><span
class="highlight">ம</span>ீட்டர்</span>

</div>

To highlight ம, there is a span surrounding it, but the 'ee' vowel sign is
outside span. For Indic languages, vowel signs cannot exist independently. And
this causes the search suggestion item appears as wrongly rendered.

The solution is not straightforward - if we want to highlight the type letter,
the span should be applied to the glyph cluster and not to the letter alone.

I am not sure if there is any easy way to do this. Browser are aware of these
rules anyway, try moving cursor over "மீட்டர்" step by step. You cannot place
your cursor between ம and ீ .

The problem is not limited to Tamil anyway. I am changing the bug summary.

Ah, i identified the issue in attachment 9734, its something to do with font not having italic glyph. Not sure if something can be done here. Ideally must be reported to a (inactive) upstream since that font is a ASCII font.

(In reply to comment #0)

I notice this rendering bug in the RecentChanges page on Ubuntu 11.10 and
Chrome 16

I can confirm this in my Chromium 15, Debian(Unstable). As far as I can tell, this is a rendering bug in chrome/chromium and need to be reported upstream(http://code.google.com/p/chromium/issues). And need a separate bug report here since rendering issue in search suggestion and this are two different issues.

  • Bug 33548 has been marked as a duplicate of this bug. ***
  • Bug 40300 has been marked as a duplicate of this bug. ***
aude added a comment.Sep 20 2012, 3:27 PM

this is not essential for Wikidata but willing to take this bug as a volunteer to fix on the weekend :)

Created attachment 11130
a simple example of different stlings applied to Tamil cluster parts

This is general rendering problem.

The behavior in Firefox is slightly less broken - in the row with the colored span it colors the whole cluster, whereas Chromium breaks it. But I'm not sure what is actually correct according to the HTML and Unicode standards.

Attached:

Do we really need this bold font styling in the first place? If we just remove it, it will fix this bug, and nobody will complain. It will probably fix Bug 26665, too.

aude added a comment.Sep 20 2012, 4:05 PM

As a workaround, I suggest a language blacklist for which autosuggestions does not use highlighting. This can include any language that uses an Arabic script, Tamil and others. (which ones?)

All RTL languages, to fix (In reply to comment #13)

As a workaround, I suggest a language blacklist for which autosuggestions does
not use highlighting. This can include any language that uses an Arabic
script, Tamil and others. (which ones?)

All RTL languages, to fix Bug 26665.

(In reply to Aude from comment #13)

As a workaround, I suggest a language blacklist for which autosuggestions
does not use highlighting. This can include any language that uses an
Arabic script, Tamil and others. (which ones?)

Any language written in devanagari, and I suspect all other Indic scripts too.

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptDec 22 2016, 12:32 PM
Liuxinyu970226 removed a subscriber: wikibugs-l-list.

Is this still an issue?

I would appreciate if someone could type a query in here that we could use to check whether this is still an issue. Screenshots are helpful, but being able to try it ourselves is even more helpful. :-)

Yes, it still happens.

I hope that this screenshot from the Tamil Wikipedia shows the problem:

Taken on Firefox on a Mac. The same happens in Chrome.

To reproduce it, go to ta.wikipedia.org and paste the single letter ட into the search box.

Notice the difference between what is shown in the search box itself, and in the first auto-completion (with the blue background). This is the same string, but the beginning of the word has a different appearance.

What's displayed in the search box is correct: "டெ". The "right angle"-shaped letter (ட) is supposed to be on the right-hand side, and the "loopy" letter is supposed to be on the left-hand side.

What's displayed in the auto-completion is incorrect. The "loopy" letter is on the right-hand side, and there's a dotted circle after it.

The right-angle letter is the consonant T, and the "loopy" letter is the vowel E that comes after it. Together it's "te". However, because of the way the Tamil alphabet works, the vowel E is displayed before the consonant, even though it's typed and pronounced after it. And since the vowel cannot come alone, a dotted circle is shown after it (ெ) instead of the consonant, to which the vowel is supposed to be attached.

The reason that the vowel is not combined correctly with the consonant is that the character that is typed into the search box is highlighted in the auto-completed results. The HTML looks like this:

<div class="suggestions-result" rel="0">
    <span class="highlight">ட</span>ெங்குக் காய்ச்சல்
</div>

Notice the single ட inside <span class="highlight">. The fact that ட and ெ are in different DOM elements causes that not to be combined correctly to a consonant-vowel cluster டெ, but to be shown separately.

Many other alphabets of India and the Middle East have similar features: characters that are combined in surprising ways. Tamil is just one example.

A simplistic solution for this would be not to use <span class="highlight"> at all. A smarter solution would be to put the whole combined character into it, but that would probably be very complicated. But really, without the different styling it looks just as good in all languages.

Back when this bug was created, browsers were definitely not smart enough about combined characters. I remember that @santhosh mentioned once or twice that this is getting better and at least JavaScript may be more aware about such advanced Unicode features, but I'm not sure that it's indeed like this.

I don't think a complex solution that takes care of ligature boundaries worth the effort there. As @Amire80 suggested, I would also suggest avoid showing fragments in bold. Even for latin this an be problematic in case combining marks I guess.

debt added a subscriber: debt.

It looks like this could be fixed in the JS, by not bolding fragments, but it'll take some time before we can get to this task.

Meno25 removed a subscriber: Meno25.Nov 23 2018, 8:06 AM
Restricted Application added a subscriber: alanajjar. · View Herald TranscriptNov 23 2018, 8:06 AM
debt moved this task from elastic / cirrus to Language Stuff on the Discovery-Search board.
TJones claimed this task.Aug 6 2019, 5:49 PM

Change 530602 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Fix highlighting of grapheme clusters

https://gerrit.wikimedia.org/r/530602

TJones added a comment.EditedAug 16 2019, 5:01 PM

TL;DR

  • Patch up to fix the problem for search snippets, even though that's not the main problem.
  • I have a working fix for upper corner search box, but it may not be up to code standards since it is in the Javascript repo.
  • Main box on Search page is a separate thing; still working on that.
  • Ligatures are more complicated.
    • options.highlightInput can probably be used to turn off highlighting for suggestions (though I'd still like to do the right thing with combining characters in the general case).
    • Cirrus doesn't have a way to disable highlighting, but one could probably be added.

Search Snippets

So, I've looked into this a lot more. I've uploaded a patch to fix this problem in search snippet highlighting—where it is actually pretty rare, since it's hard to get a partial word match except when using a regex.

I used the Unicode-aware regex \p{M} which matches all combining characters (very convenient!).

I've tested it on Hindi, Javanese, Khmer, Myanmar, Tamil, Telugu, and Thai examples. Below are screenshots of Hindi, Javanese, and Myanmar examples. The Hindi example pulls the following vowel mark into the highlight. The Javanese (which is very hard to see), is a weird one because the regex query is a combining mark followed by a full character. The full character from before the highlight is pulled in, and the combining characters from after the highlight are also pulled in. The Myanmar example is similar to the Hindi example, except that several combining characters are pulled in from after the highlight.

The patch above (530602) should fix this.

Latin has it easy (maybe)

Since @santhosh mentioned that Latin could have this kind of problem (and I thought the same thing), I checked into it. It is probably operating system dependent, but on my Mac, non-highlighted diacritics can still glom onto highlighted full characters. I only did a quick test with n̈ and n̞̚, but both work well enough when the n is bold but the diacritics are not.

I found the same situation when bolding characters on my computer in general. I'm curious how others see these two: ခင်ဦးမြိနယ်—but Spin̈al Tap. The font on Phabricator gets the umlaut only half on the n, but the Myanmar is messed up in the usual way.

Suggestion Highlighting

The suggestion highlighting, which was the main issue of this ticket, is harder. It's not actually something done by the suggestion API. It's done in javascript by the UI after fetching the suggestions. It's not my area of expertise and the UI folks may not like the patch that I hope to put up soon. If that's the case, then we can move this to be a UI task rather than a Search task.

I've got working code for the upper corner search box, There's no equivalent of \p{M} in Javascript, so I've hard-coded an equivalent constant regex for that purpose. I am not familiar with the code and it is a lot more complex (big surprise!) than the baby Javascript I normally work with. (Which is why the UI folks may not like my patch.)

Oddly, the main search box on the Search results page does not use the same code to fetch or highlight suggestions, so I'm still trying to track that down. My hope is that I can get something that works and does the Unicode stuff and the UI folks can refactor it into something that is also good code. If I can't find it soonish, I'll probably just post my fix for the upper corner search box and hand it off to the UI folks.

Ligatures

None of this addresses the problem of ligatures, which comes up in (at least) Arabic, Hindi, Khmer, and Telegu. Detecting when characters can form ligatures is a lot more complex than detecting individual combining characters that must attach to a full character.

Below are some examples where highlighting breaks ligatures, even when the combining characters are treated correctly (in both text and image form).

  • بافاريا --> بافاريا
  • कुम्भ मेला --> कुम्भ मेला
  • បេតវត្ថុ --> បេតវត្ថុ
  • జార్జ్ స్టబ్స్ --> జార్జ్ స్టబ్స్

There is still the option of disabling highlighting in the suggestions if ligatures are more important than highlighting. In the Javascript code I see indications that this is possible, but I do not understand the how the configuration works and it is really hard (for me) to track things down. So I would leave that as an exercise for the UI folks.

In the search snippets, I don't see anything that looks like a config to disable highlighting. The highlighting comes out of Elasticsearch, so we'd have to strip it rather than convert it. That seems doable. Though it is a much rarer case in search snippets because it's harder to get a partial word to highlight than in the suggestions.

More as it develops.

Change 530602 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Fix highlighting of grapheme clusters in search snippets

https://gerrit.wikimedia.org/r/530602

This comment was removed by TJones.

@debt (or others in Search), please don't close this ticket, just remove our tag. There is still work to be done, I think on the OOUI/OOjs side.

This issue is reproducible only on wikimedia sites.

I was able to reproduce this on Google, but it required "tricking" it into highlighting when it shouldn't. Generally, Google seems to have disabled highlighting for at least some of the languages that can have this problem. For others, I was able to get some highlighting only by searching on the main google page and then clicking in the search box on the results page (e.g., Hindi & Tamil). Others have the problem all the time (Myanmar, Khmer). And then Javanese doesn't have the problem because there don't seem to be any suggestions in the Javanese script (at least not that I could find).

So, disabling highlighting is still a viable solution on some wikis (though searching for Hindi on English Wikipedia (e.g., विश, which generates the suggestion विश ्व एक).

Next Steps

I've got the combining diacritic highlighting working on the main search box. I'm going try to do a little refactoring to make the code less hacky, and then I'll upload a patch or two and hand this off to the OOUI/OOjs folks (or whoever is a better next owner).

Change 530929 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] Preserve grapheme clusters in upper corner completion suggester highlighting

https://gerrit.wikimedia.org/r/530929

Change 530960 had a related patch set uploaded (by Tjones; owner: Tjones):
[oojs/ui@master] Add option to preserve grapheme clusters in highlightQuery

https://gerrit.wikimedia.org/r/530960

Patches and Projects

I've uploaded a patch for the upper corner search box. I've added the MediaWiki-General project to find someone who might be able to review this.

I've uploaded a patch for changes to OOUI to enable fixing the main search box on Special:Search. mw.widgets.TitleOptionWidget.js still needs to be updated to use the change once it is approved. I've added the OOjs and OOUI projects to find someone who might be able to review this.

I've also added the Javascript project because it's all about the Javascript.

I've removed Discovery and MediaWiki-Search because the search bit is done. I'm leaving "Discovery-Search (Current work)" for now for tracking on our phab workboard.

Recent Changes

One of the early images attached to this ticket shows problems with highlighting in the Recent Changes list. Nothing I've done covers that, and the main description is about search suggestions. Obviously, the general problem of combining characters not being treated correctly by partial highlighting could apply in lots of places. (Also, I checked and the current recent changes page doesn't seem to have any bolding/highlighting so that may have been incidentally fixed since the ticket was opened.)

matmarex added a subscriber: matmarex.

Moving this ticket back to our backlog board. Please let us know if there is anything further the Discovery-Search team can help with.

debt added a comment.Aug 27 2019, 2:08 PM

Thanks for all the work, @TJones !

@TJones For future reference, can you share how you created the regexp you used instead of \p{Mark}? I tried replicating it using this tool: https://mothereff.in/regexpu#input=%2F%5Cp%7BM%7D%2Fu&dotAllFlag=1&unicodePropertyEscape=1 and got a different result. (Perhaps this tool and whatever you used generate the regexps from different Unicode versions.)

Change 530960 merged by jenkins-bot:
[oojs/ui@master] Add option to preserve grapheme clusters in highlightQuery

https://gerrit.wikimedia.org/r/530960

Change 530929 merged by jenkins-bot:
[mediawiki/core@master] Preserve grapheme clusters in upper corner completion suggester highlighting

https://gerrit.wikimedia.org/r/530929

@TJones For future reference, can you share how you created the regexp you used instead of \p{Mark}? I tried replicating it using this tool: https://mothereff.in/regexpu#input=%2F%5Cp%7BM%7D%2Fu&dotAllFlag=1&unicodePropertyEscape=1 and got a different result. (Perhaps this tool and whatever you used generate the regexps from different Unicode versions.)

I used the BMP definition of \p{Mark} from xregexp as a start. I wouldn't be shocked if over the next few years a few minor tweaks are requested from users; testing everything would be very difficult.

I'm happy with any reasonable definition. The one from the transpiler you linked looks fine. And of course one could recreate the practical definition from any programming language you like by testing all the various characters to see what matches.

If I understand correctly, https://gerrit.wikimedia.org/r/c/oojs/ui/+/530960 includes the combining marks after the search match to the highlight range, so that we don't add a highlight boundary before combining mark. A consequence of this approach is, search results that are completely different words will be highlighted. To illustrate this:

Search query(Language: ml): കാള
Search results with highlight:
കാള (meaning: bullock,ox)
കാളി (meaning: a goddess)

So, we will highlight fragments or complete words that has different meaning. Basically, the combining marks can cause word meaning change in many languages(At least in Hindi, Malayalam, Tamil, Kannada, Telugu etc)

Applying bold to a part of the word is not nice from typography perspective too.

I am more inclined to suggest avoid the highlight. But if we really want a highlighting feature without altering the text and not breaking ligatures, consider approaches like https://codepen.io/santhoshtr/pen/oNvGjow where we create a highlight element with required width and apply as overlay.

TJones added a comment.EditedSep 3 2019, 4:09 PM

Comment deleted. [Ugh.. there is some keyboard sequence that I keep fat-fingering and submitting my comment before it is complete.]

TJones added a comment.Sep 3 2019, 5:12 PM

If I understand correctly, https://gerrit.wikimedia.org/r/c/oojs/ui/+/530960 includes the combining marks after the search match to the highlight range, so that we don't add a highlight boundary before combining mark.

Yes, that is the intent of all three patches. One for search snippets (530602), one for the upper corner completion suggester (530929), and one for the main search box completion suggester on Special:Search (530960, though it needs a follow-on patch to activate it). The last two do essentially the same thing, but use completely different code to do it.

A consequence of this approach is, search results that are completely different words will be highlighted. To illustrate this:
Search query(Language: ml): കാള
Search results with highlight:
കാള (meaning: bullock,ox)
കാളി (meaning: a goddess)
So, we will highlight fragments or complete words that has different meaning. Basically, the combining marks can cause word meaning change in many languages(At least in Hindi, Malayalam, Tamil, Kannada, Telugu etc)

Assuming there's going to be highlighting, it seems that highlighting കാളി for കാള is better than highlighting കാളി. (Image included below in case the formatting comes out different on different OS/browser combinations.)

At the very least it's an upgrade from broken to merely wrong.

For those not familiar with the scripts, it is probably roughly similar in Latin script to highlighting the next vowel if your search query ends in a consonant. So search for "rat" would highlight "Rate equation", "Rationalism", and "Ratatouille", but "Ratchet".

Applying bold to a part of the word is not nice from typography perspective too.

Do you mean in general or specifically for Indic languages? It is kind of weird when you think about it, even in English. Searching for "rat" gives a highlight that splits rate, which is pronounced very differently from "rat".

But the partial highlighting does seem to be an established pattern in commercial search—though oddly we do it backwards on-wiki. Google, Bing, and Baidu all do it (for Latin text), though they highlight the rest of the suggestion, not the part you typed.

I am more inclined to suggest avoid the highlight. But if we really want a highlighting feature without altering the text and not breaking ligatures, consider approaches like https://codepen.io/santhoshtr/pen/oNvGjow where we create a highlight element with required width and apply as overlay.

I'm not at all against disabling the highlighting for specific languages, and I'd be happy to work on developing a list of languages to disable it for—but I have no idea how to implement it in OOJS and Mediawiki Core, or whether they can share a single config flag (again, unrelated code doing the same thing, so possibly unrelated config, too).

The example code at CodePen didn't work for me—it only underlined the first two characters:

It could be a font problem on my end, and I'm sure it's fixable to get a consistent width.

But that approach doesn't solve the more general problem, since in some scripts some elements can be added under or even before the main element (which is typed first). There's no easy way to highlight just क or ខ in the examples below:

Devanagari: क + ृ = कृ
Khmer: ខ + ្ម + ែ = ខ្មែ

(Also, implementing this for search snippets would be ugly, since everything is done server-side in PHP, not in the browser/client.)

I readily support turning off highlighting for the search suggestions for specific languages, because for most it is always a prefix of the suggestion—though some Wikisources can match sub pages. (And maybe we should invert the highlighting where we keep it to match the pattern used by many other search engines).

I think in the search snippets it's still useful to highlight the chunk of text that's at least partly responsible for the result being returned. It's much easier to skim the results with that highlighting. If so, then including combining elements in the highlight is better than not ( കാളി vs കാളി). Since you can, for example, search for Malayalam words on English Wiki (or mixed English/Malayalam strings: "text ക കാ"), and having script-specific highlighting everywhere would be very difficult to implement.

TL;DR:

  • It's probably best to turn off the bold highlighting in the completion suggester for certain languages
    • I'm happy to work on generating the list of languages where it should be turned off, but I can't implement it
    • It has to be implemented in two different places (OOJS/UI for Special:Search and Mediawiki Core for the upper corner search) and may require two separate configs
  • I'd argue that in search snippets, highlighting is useful, even if the highlight isn't the exact searched string, because it makes it easy to find your matching text in the snippet.
  • If highlighting exists anywhere, then we still have a problem with "foreign" queries in highlight-incompatible scripts, like the screenshot below of searching for Malayalam on English wiki. This would also be a problem for multi-lingual wikis like Commons, where some users probably want to keep highlighting.

If this sounds reasonable, then we need to find some Javascript experts to add config to disable highlighting by language or individual wiki, and then come up with the list of languages/projects were it should be disabled. (And we should definitely start a new ticket.)

Change 534517 had a related patch set uploaded (by VolkerE; owner: VolkerE):
[mediawiki/core@master] Update OOUI to v0.34.0

https://gerrit.wikimedia.org/r/534517

Volker_E moved this task from Backlog to OOUI-0.34.0 on the OOUI board.Sep 4 2019, 7:59 PM
Volker_E edited projects, added OOUI (OOUI-0.34.0); removed OOUI.

Change 534517 merged by jenkins-bot:
[mediawiki/core@master] Update OOUI to v0.34.0

https://gerrit.wikimedia.org/r/534517

Change 538970 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] Enable preservation of grapheme clusters in highlightQuery

https://gerrit.wikimedia.org/r/538970

Change 538970 merged by jenkins-bot:
[mediawiki/core@master] Enable preservation of grapheme clusters in highlightQuery

https://gerrit.wikimedia.org/r/538970

TJones closed this task as Resolved.Wed, Oct 9, 8:54 PM

Following combining marks are included with highlighted text to prevent broken-looking text when highlighting search suggestions and in search result snippets.

Adding an option to disable search suggestion highlighting for specific languages with complex grapheme clusters or ligatures should probably be opened as a new ticket.