Page MenuHomePhabricator

Search should not be case and accent sensitive
Closed, ResolvedPublic

Description

The classical way to find an article in a ZIM file is using findx() or findxByTitle(). Using them allows the developer to find all articles having an url/title starting with the given pattern (string).

Unfortunately, these searches are case and accent sensitive. The reason is that both search methods use alphabetically sorted lists (and a dichotomy) based respectively on articles urls and titles.

These searches should not be case and accent sensitive and results should include results also if the letters are uppercase or with accents.

To fix that problem:

  • Both titles and urls sorted lists should be computed on a "normalised" version of the url/title strings (ie. lowercased and without accents)
  • findx() or findxByTitle() should both normalized the search string pattern before any search

Remark:
Here the developer of the patch needs to be pretty careful with the collation, which is different from language to language. The usage of libicu seems mandatory.

Event Timeline

Kelson raised the priority of this task from to Medium.
Kelson updated the task description. (Show Details)
Kelson added a project: openZIM-zimlib.
Kelson subscribed.

An other solution, would be to fully rely on the fulltext indexed (powered by Xapian) which can meanwhile be merged within the ZIM file.

Kelson claimed this task.

This finally has been fixed by introducing Xapian to manage a list of title.