Page MenuHomePhabricator

Serbian language search differentiates between Cyrillic and Latin alphabets
Closed, ResolvedPublic

Description

Serbian language uses both Cyrillic and Latin alphabet. Therefore, the search should give results regardless of the alphabet used in the article or the search query, and the search results should have the same weight. Examples:

  1. Search for "хусова улица": https://sr.wikipedia.org/w/index.php?title=Посебно:Претражи&profile=default&fulltext=Search&search=хусова+улица&searchToken=62nrq5r3uq29i4acn16ll6urz&uselang=en
    1. Expected: the page https://sr.wikipedia.org/wiki/Husova_ulica_(Beograd) should be found.
    2. Observed: the page was not found.
  2. Search for "beograd": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=beograd&searchToken=99r1kbeyilpi6ju3tnwjq3npj&uselang=en
    1. Expected: the page https://sr.wikipedia.org/wiki/%D0%91%D0%B5%D0%BE%D0%B3%D1%80%D0%B0%D0%B4 should be at the top.
    2. Observed: the page https://sr.wikipedia.org/wiki/Blu_dragonsi_Beograd is at the top.

An overview of Serbian language needs could be seen at https://wiki.apache.org/solr/SerbianLanguageSupport . Of course, Wikimedia doesn't have to use the exact same solution.

Event Timeline

Wikimedia sites do not use MediaWiki's default search backend (MediaWiki-Search), hence setting CirrusSearch.

debt triaged this task as Medium priority.Jul 1 2016, 4:46 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt subscribed.

This can be (hopefully) be easy to fix and is related to T138858

It looks like a potential solution to this is Elastic's "serbian_normalization" plugin. It has been in "not-released-yet" status since v2.0 (the current version is v5.4), but it is in the Elastic code and already available in my development installation. I don't know if it is unstable or just improperly documented, but it hasn't been updated in a long time.

It is built on the Lucene analyzer, but it does not have the "haircut" option ("bald" Latin vs Latin with diacritics). The haircut option also seems to have been removed from the Lucene code.

So, if we wanted to implement both a Cyrillic-to-Latin conversion and the bald-Latin conversion (from T138858), we could do that by enabling this plugin for Serbian-language wikis.

As the Lucene link from the Description points out, this does end up conflating some words with diacritics in Latin, like strašan/strasan, and teža/teza. This conflation would map back into the Cyrillic, too, because the Cyrillic would be mapped to the bald Latin for indexing and search.

The mapping is:

a б в г д ђ  е ж з и ј к л љ  м н њ  о п р с т ћ у ф х ц ч џ  ш / č ć đ  š ž
a b v g d dj e z z i j k l lj m n nj o p r s t c u f h c c dz s / c c dj s z

If that doesn't cover everything, let me know and I can try out other characters, too.

Some pros and cons:

  • Pro++: the conversion only affects the text that is indexed, so the article text is unchanged. If "teža" or "тежа" is in an article, it would be indexed as "teza", but unchanged in the article.
  • Pro+: We also search an unchanged version of the text, so a query of "teža" would match "teža" somewhat better than "teza". It would not get any boost matching "тежа", though.
  • Con-: Words with non-Serbian Cyrillic would only be partially converted. So Russian "чёрная дыра" ("black hole") would be indexed as "cёrnaя dыra". I don't think that's a problem, but there could be weird collisions. The Faux Cyrillic band name "LIИKIИ PARK" would get indexed as "liikii park". ("KoЯn" comes out relatively unscathed.)
  • Con-: There is a chance there is some difference between the conversion done by the LanguageConverter (for display) and this plugin (or custom version of it, see below) that could cause confusion.
  • Con--: This is not clearly an official release from Elastic, so it could disappear or change.
    • Pro+: We don't want to wait forever, though.
    • Pro++: On the other hand, if all we need is a very simple mapping as above, then it would be very easy to write this as custom filter, and we could choose either the bald or not-bald version. (I think doing supporting both bald and not-bald would be hard and would require a level of customization for Serbian-language wikis that we aren't able to support.)

If the community supports using bald Latin for searching and matching—meaning that some distinctions could be lost—then this is straightforward. Of course I'd do the usual analysis and set up an instance in labs for testing before making it live, so we could be sure it works as intended.

We would still need deal with the bald Latin search (T138858), but the upcoming Serbian analysis chain (T183015) will take care of the Cyrillic-vs-Latin search, in addition to doing some basic stemming.

There will still be a slight preference for exact matches in the same alphabet, just as there is a slight preference for exact matches that are stemmed. As an example of the stemming match, if you search for dogs then dog is a good match, but dogs is a slightly better match, via the plain field. The plain field also allows you to match exact forms with quotes, so that searching for "dogs" does not match dog.

TJones claimed this task.

This is working now on all Serbian-language wikis. Ranking still prefers exact matches, so Beograd and Београд give the same results but in a different order—and since there are over 40,000 results and many with partial title matches, the order difference can be significant.

More targeted searches like хусова улица and husova ulica with fewer results and only one really good title match give the same results.

(The effect on sister search is cool, and I'm entertained by searching for хот дог.)