Use Unicode Character Folding for accents, punctuation chars in search index
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bzimport
	Dec 30 2005, 6:20 PM

Description

Author: wikimedia-bugzilla

Description:
It would be desirable for Search and especially for the "Go" functionality
(resolving a page title to an actual page without an intermediate search) to use
all sensible Unicode Foldings on the searched titles.

Unicode Character Foldings define[1] string transformations for making two
strings search-equivalent (different from Unicode normalizations, which make
strings content-equivalent). The folded title should not be stored instead of
the original title but in addition to it, and when searching, the comparisons
should be made between a folded search string and the folded title.

We already do certain forms of folding, such as case insensitivity, but we could
benefit from the full set of foldings, such as eliminating the difference
between minus and dashes and more.

[1] http://www.unicode.org/unicode/reports/tr30/

Version: unspecified
Severity: major

Details

Reference: bz4430

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		TJones	T26414 Special character "å" in the search menu
		Resolved		TJones	T6430 Use Unicode Character Folding for accents, punctuation chars in search index

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:00 PM

• bzimport added projects: MediaWiki-Search, I18n.

• bzimport set Reference to bz4430.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Dec 30 2005, 6:20 PM

Bug 14180 has been marked as a duplicate of this bug. ***

De-assigning, as no activity in 3 years. Still a good idea though! :)

K-form normalization would be easy to apply (since UtfNormal class already implements it); other folding may require more coding.

Bug 4379 has been marked as a duplicate of this bug. ***

Bug 20529 has been marked as a duplicate of this bug. ***

This is the same bug as in https://translatewiki.net/wiki/Thread:Support/Search_index_should_ignore_punctuation, isn't it?

I stumbled across this one today, and I'm going to close it because this functionality is generally available and specifically activated on most wikis. For example, searching for Ⱥȴƃǝṟʈ Ɛĭñṥẗȅɨǹ in the Go box brings up the right suggestion (Albert Einstein) from the autocomplete, and hitting return (without selecting the suggestion) rolls over to the right article. Tested on English, French, Russian, Chinese, Arabic, Igbo, Bosnian, Hawaiian, German, and Bambara Wikipedias.

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptOct 10 2017, 7:36 PM

Use Unicode Character Folding for accents, punctuation chars in search indexClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Use Unicode Character Folding for accents, punctuation chars in search index
Closed, ResolvedPublic
Actions

Related Objects
Search...