VisualEditor: Support unicode equivalence for client side text searches
Open, MediumPublicFeature
Actions

Assigned To

None

Authored By

	Esanders
	Jun 25 2013, 11:02 AM

Description

Planned features, such as searching for an existing reference by content, will require us to implement some http://en.wikipedia.org/wiki/Unicode_equivalence .

We will probably want to use NFKD ("Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.") to catch cases such as 'ﬀ' === 'ff', and we will probably want to strip combining characters (i.e. all accents), so that 'Amelie' === 'Amélie'.

https://github.com/walling/unorm looks like a good library for the job. We may want to fork it into UnicodeJS.

Version: unspecified
Severity: enhancement

Details

Reference: bz50167

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Invalid		Jdforrester-WMF	T35077 VisualEditor multilingual input / i18n issues (tracking)
		Open	Feature	None	T52167 VisualEditor: Support unicode equivalence for client side text searches

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:03 AM

• bzimport added a project: VisualEditor-DataModel.

• bzimport set Reference to bz50167.

Esanders created this task.Jun 25 2013, 11:02 AM

We probably shouldn't strip down beyond NFKD. For some languages, 'ä' should be equivalent to 'a'; for others, it shouldn't be equivalent to anything; for still others, it should be equivalent to 'ae'.

Will it be feasible to implement language-specific search on top of this?

I don't see why not. We may want to add things like 'ß' => 'ss' in German, or final vs. non-final sigma in Greek (https://en.wikipedia.org/wiki/Sigma#Character_Encodings)

It's worth noting that in most software, many common grapheme clusters are displayed more correctly when encoded as a single unicode character than when encoded with combining characters. For example, 'sgrîn' ("sgr\u00EEn") displays correctly in my version of Firefox on Linux, but the equivalent decomposed string 'sgrîn' ("sgri\u0302n") shows up with the dot still on the i and the accent in the wrong place (either uncentered over the i, or over the n, depending on the font).

Therefore, while we may want to search and process text using decomposed forms, we should probably use the composed forms for display.

Agreed. You're likely to do that naturally when displaying search results but it would be a consideration if you try to highlight the matching substring in the result (a non-trivial problem when normalisation is involved)

So, if I'm understanding correctly, when the user starts a search we want to generate a normalised copy of the entire document in NFKD. (Otherwise we've got a problem keeping two copies in sync). Is this acceptable efficiency-wise?

Could we force the characters in the document model to be in NFC? Could Parsoid provide the article text in NFC? (This is partially off-topic, but we probably want to consider different normalisation issues together).

Jdforrester-WMF added a project: VisualEditor.Nov 24 2014, 3:56 PM

Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.Nov 24 2014, 4:29 PM

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM

Aklapper removed a subscriber: • TrevorParscal.

VisualEditor: Support unicode equivalence for client side text searchesOpen, MediumPublicFeatureActions

Description

Details

Related ObjectsSearch...

Event Timeline

VisualEditor: Support unicode equivalence for client side text searches
Open, MediumPublicFeature
Actions

Related Objects
Search...