Use Unicode Character Folding for accents, punctuation chars in search index


Author: wikimedia-bugzilla

It would be desirable for Search and especially for the "Go" functionality
(resolving a page title to an actual page without an intermediate search) to use
all sensible Unicode Foldings on the searched titles.

Unicode Character Foldings define[1] string transformations for making two
strings search-equivalent (different from Unicode normalizations, which make
strings content-equivalent). The folded title should not be stored instead of
the original title but in addition to it, and when searching, the comparisons
should be made between a folded search string and the folded title.

We already do certain forms of folding, such as case insensitivity, but we could
benefit from the full set of foldings, such as eliminating the difference
between minus and dashes and more.


Version: unspecified
Severity: major

bzimport added projects: MediaWiki-Search, I18n.Via ConduitNov 21 2014, 9:00 PM
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz4430.
bzimport created this task.Via LegacyDec 30 2005, 6:20 PM
brion added a comment.Via ConduitMay 19 2008, 5:27 PM
  • Bug 14180 has been marked as a duplicate of this bug. ***
brion added a comment.Via ConduitDec 28 2008, 9:14 PM

De-assigning, as no activity in 3 years. Still a good idea though! :)

K-form normalization would be easy to apply (since UtfNormal class already implements it); other folding may require more coding.

demon added a comment.Via ConduitSep 3 2009, 1:53 AM
  • Bug 4379 has been marked as a duplicate of this bug. ***
demon added a comment.Via ConduitSep 7 2009, 3:42 PM
  • Bug 20529 has been marked as a duplicate of this bug. ***

Add Comment