Page MenuHomePhabricator

Serbian language search does not allows for use of bald Latin alphabet
Open, MediumPublic

Description

In search, most Internet users use bald Latin alphabet (without letters č, ć, š, ž and đ). This is similar to how in German language the search for "Muenchen" will return the results for "München". Thus, Serbian Wikipedia should support searching in this way, but it doesn't. Example:

  1. Search for "marković": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=markovi%C4%87&searchToken=cibdktt9t7eu2hv4o3n1hgg84
    1. Observed: 207 search results.
  2. Search for "markovic": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=markovic&searchToken=gf3dawrz4tio3a91fujm144m
    1. Expected: all the 207 previous search results should appear.
    2. Observed: Only 47 results appear.

An overview of the issue is given at https://wiki.apache.org/solr/SerbianLanguageSupport

Event Timeline

Wikimedia sites do not use MediaWiki's default search backend (MediaWiki-Search), hence setting CirrusSearch.

debt triaged this task as Medium priority.Jul 1 2016, 4:45 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added a subscriber: debt.

We'll take a look and hopefully it'll be fairly 'easy' to fix.

If we want both bald Latin and Cyrillic-to-Latin mapping, it looks to be straightforward. See T138857#3391852 for more details.