Page MenuHomePhabricator

Normalize accented/ligature characters for search terms and indices ("ignore" accents)
Closed, ResolvedPublic

Description

Author: giorgian

Description:
Most European languages use letters which not present in the standard ASCII set.

it would be IHMO very useful allowing a more elastic search, that is:

  • ae, oe, ue for ä, ö, ü; for instance, Goedel for Gödel;
  • ss for ß; for instance, Grossmann for Großmann;
  • a, e, i etc. for à, á, â, è, é, ê, ì, í, î etc; for instance, geologie (and

gèologie) for géologie;

  • n for ñ; for instance, Bunuel for Buñuel;
  • o (don't know if it's the best letter) for ø; for instance, Kobenhavn for

København;

  • aa for å; for instance, Aahrus for Århus;
  • C (or a better choice) for Č; for instance, Cesky for Česky;
  • maybe others that I don't know.

this would be useful for a number of reasons:

  1. most keyboard layouts lack some letters;
  2. there is a long lasting tradition, among internet users, to avoid nonASCII

characters for compatibility, and, therefore, the habit to use "semplified"
versions;

some of the previous substitutions are officially accepted in printing

conventions: it's the case of the German ae, oe, ue, ss;

  1. often somebody doesn't know the exact spelling for a world in a foreign language;
  2. Google does it already! :P

Version: unspecified
Severity: enhancement

Details

Reference
bz7002

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:21 PM
bzimport added a project: MediaWiki-Search.
bzimport set Reference to bz7002.
bzimport added a subscriber: Unknown Object (MLST).

justforspam wrote:

ñ is sometime transcripted as "nh"

giorgian wrote:

(In reply to comment #1)

ñ is sometime transcripted as "nh"

you're right, i didn't remember.

ose-7c9c wrote:

right!

2 examples:

  1. a search for 'emmaus' http://fr.wikipedia.org/wiki/Special:Search?search=emmaus&go=Consulter

matches only for 13% 'emmaüs'

  1. a search for 'circe' http://fr.wikipedia.org/wiki/Special:Search?search=circe&go=Consulter

matches only for 1.4% 'circé' but 100% circ

sounds very impratical for at least french users who have a strong habit of non accentuated search strings

cecile.robin wrote:

A simpler suggestion would be to ignore accents, this would be very useful for
languages such as french and greek, and i suppose for many others. This is the
way the google search engine works. Ok it does not help for special letters such
as the german β for example (sorry it's written in greek keyboard...), you need
to configure your keyboard to enter such letters but it would work better as a
standard and would be very helpful when you're not sure which accent goes on
which letter. That's my opinion anyway...

ayg wrote:

*** Bug 9606 has been marked as a duplicate of this bug. ***

wikibugs wrote:

*** This bug has been marked as a duplicate of bug 920 ***