VisualEditor: Support unicode equivalence for client side text searches
Open, NormalPublic

Description

Planned features, such as searching for an existing reference by content, will require us to implement some http://en.wikipedia.org/wiki/Unicode_equivalence .

We will probably want to use NFKD ("Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.") to catch cases such as 'ff' === 'ff', and we will probably want to strip combining characters (i.e. all accents), so that 'Amelie' === 'Amélie'.

https://github.com/walling/unorm looks like a good library for the job. We may want to fork it into UnicodeJS.


Version: unspecified
Severity: enhancement

bzimport set Reference to bz50167.

We probably shouldn't strip down beyond NFKD. For some languages, 'ä' should be equivalent to 'a'; for others, it shouldn't be equivalent to anything; for still others, it should be equivalent to 'ae'.

Will it be feasible to implement language-specific search on top of this?

I don't see why not. We may want to add things like 'ß' => 'ss' in German, or final vs. non-final sigma in Greek (https://en.wikipedia.org/wiki/Sigma#Character_Encodings)

It's worth noting that in most software, many common grapheme clusters are displayed more correctly when encoded as a single unicode character than when encoded with combining characters. For example, 'sgrîn' ("sgr\u00EEn") displays correctly in my version of Firefox on Linux, but the equivalent decomposed string 'sgrîn' ("sgri\u0302n") shows up with the dot still on the i and the accent in the wrong place (either uncentered over the i, or over the n, depending on the font).

Therefore, while we may want to search and process text using decomposed forms, we should probably use the composed forms for display.

Agreed. You're likely to do that naturally when displaying search results but it would be a consideration if you try to highlight the matching substring in the result (a non-trivial problem when normalisation is involved)

dchan added a comment.Jul 11 2013, 2:39 PM

So, if I'm understanding correctly, when the user starts a search we want to generate a normalised copy of the entire document in NFKD. (Otherwise we've got a problem keeping two copies in sync). Is this acceptable efficiency-wise?

Could we force the characters in the document model to be in NFC? Could Parsoid provide the article text in NFC? (This is partially off-topic, but we probably want to consider different normalisation issues together).

Add Comment