VisualEditor: Support unicode equivalence for client side text searches
OpenPublic

Description

Planned features, such as searching for an existing reference by content, will require us to implement some http://en.wikipedia.org/wiki/Unicode_equivalence .

We will probably want to use NFKD ("Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.") to catch cases such as 'ff' === 'ff', and we will probably want to strip combining characters (i.e. all accents), so that 'Amelie' === 'Amélie'.

https://github.com/walling/unorm looks like a good library for the job. We may want to fork it into UnicodeJS.


Version: unspecified
Severity: enhancement

bzimport added a project: VisualEditor-DataModel.Via ConduitNov 22 2014, 2:03 AM
bzimport set Reference to bz50167.
Esanders created this task.Via LegacyJun 25 2013, 11:02 AM
dchan added a comment.Via ConduitJun 25 2013, 12:30 PM

We probably shouldn't strip down beyond NFKD. For some languages, 'ä' should be equivalent to 'a'; for others, it shouldn't be equivalent to anything; for still others, it should be equivalent to 'ae'.

Will it be feasible to implement language-specific search on top of this?

Esanders added a comment.Via ConduitJun 25 2013, 2:04 PM

I don't see why not. We may want to add things like 'ß' => 'ss' in German, or final vs. non-final sigma in Greek (https://en.wikipedia.org/wiki/Sigma#Character_Encodings)

dchan added a comment.Via ConduitJun 28 2013, 11:47 AM

It's worth noting that in most software, many common grapheme clusters are displayed more correctly when encoded as a single unicode character than when encoded with combining characters. For example, 'sgrîn' ("sgr\u00EEn") displays correctly in my version of Firefox on Linux, but the equivalent decomposed string 'sgrîn' ("sgri\u0302n") shows up with the dot still on the i and the accent in the wrong place (either uncentered over the i, or over the n, depending on the font).

Therefore, while we may want to search and process text using decomposed forms, we should probably use the composed forms for display.

Esanders added a comment.Via ConduitJun 28 2013, 12:54 PM

Agreed. You're likely to do that naturally when displaying search results but it would be a consideration if you try to highlight the matching substring in the result (a non-trivial problem when normalisation is involved)

dchan added a comment.Via ConduitJul 11 2013, 2:39 PM

So, if I'm understanding correctly, when the user starts a search we want to generate a normalised copy of the entire document in NFKD. (Otherwise we've got a problem keeping two copies in sync). Is this acceptable efficiency-wise?

Could we force the characters in the document model to be in NFC? Could Parsoid provide the article text in NFC? (This is partially off-topic, but we probably want to consider different normalisation issues together).

Jdforrester-WMF moved this task to Backlog on the VisualEditor workboard.Via WebNov 24 2014, 4:29 PM

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.