Serbian language search differentiates between Cyrillic and Latin alphabets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Nikola_Smolenski
	Jun 28 2016, 4:00 PM

Description

Serbian language uses both Cyrillic and Latin alphabet. Therefore, the search should give results regardless of the alphabet used in the article or the search query, and the search results should have the same weight. Examples:

Search for "хусова улица": https://sr.wikipedia.org/w/index.php?title=Посебно:Претражи&profile=default&fulltext=Search&search=хусова+улица&searchToken=62nrq5r3uq29i4acn16ll6urz&uselang=en
1. Expected: the page https://sr.wikipedia.org/wiki/Husova_ulica_(Beograd) should be found.
2. Observed: the page was not found.
Search for "beograd": https://sr.wikipedia.org/w/index.php?title=%D0%9F%D0%BE%D1%81%D0%B5%D0%B1%D0%BD%D0%BE:%D0%9F%D1%80%D0%B5%D1%82%D1%80%D0%B0%D0%B6%D0%B8&profile=default&fulltext=Search&search=beograd&searchToken=99r1kbeyilpi6ju3tnwjq3npj&uselang=en
1. Expected: the page https://sr.wikipedia.org/wiki/%D0%91%D0%B5%D0%BE%D0%B3%D1%80%D0%B0%D0%B4 should be at the top.
2. Observed: the page https://sr.wikipedia.org/wiki/Blu_dragonsi_Beograd is at the top.

An overview of Serbian language needs could be seen at https://wiki.apache.org/solr/SerbianLanguageSupport . Of course, Wikimedia doesn't have to use the exact same solution.

Related Objects

Mentioned In: T183015: Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries
T138858: Serbian language search does not allows for use of bald Latin alphabet
T77967: Language converter can't work on the results of Special:Search
T138854: Serbian Wikipedia search offers to create existing articles
Mentioned Here: T183015: Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries
T77967: Language converter can't work on the results of Special:Search
T138858: Serbian language search does not allows for use of bald Latin alphabet

Event Timeline

Nikola_Smolenski created this task.Jun 28 2016, 4:00 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 28 2016, 4:00 PM

Krenair added projects: MediaWiki-Search, MediaWiki-Internationalization.Jun 28 2016, 4:01 PM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptJun 28 2016, 4:01 PM

Wikimedia sites do not use MediaWiki's default search backend (MediaWiki-Search), hence setting CirrusSearch.

This can be (hopefully) be easy to fix and is related to T138858

debt mentioned this in T138854: Serbian Wikipedia search offers to create existing articles.Jul 1 2016, 4:48 PM

Probably see also: T77967 (for Chinese?)

Liuxinyu970226 mentioned this in T77967: Language converter can't work on the results of Special:Search.Jul 24 2016, 12:07 AM

It looks like a potential solution to this is Elastic's "serbian_normalization" plugin. It has been in "not-released-yet" status since v2.0 (the current version is v5.4), but it is in the Elastic code and already available in my development installation. I don't know if it is unstable or just improperly documented, but it hasn't been updated in a long time.

It is built on the Lucene analyzer, but it does not have the "haircut" option ("bald" Latin vs Latin with diacritics). The haircut option also seems to have been removed from the Lucene code.

So, if we wanted to implement both a Cyrillic-to-Latin conversion and the bald-Latin conversion (from T138858), we could do that by enabling this plugin for Serbian-language wikis.

As the Lucene link from the Description points out, this does end up conflating some words with diacritics in Latin, like strašan/strasan, and teža/teza. This conflation would map back into the Cyrillic, too, because the Cyrillic would be mapped to the bald Latin for indexing and search.

The mapping is:

a б в г д ђ  е ж з и ј к л љ  м н њ  о п р с т ћ у ф х ц ч џ  ш / č ć đ  š ž
a b v g d dj e z z i j k l lj m n nj o p r s t c u f h c c dz s / c c dj s z

If that doesn't cover everything, let me know and I can try out other characters, too.

Some pros and cons:

Pro++: the conversion only affects the text that is indexed, so the article text is unchanged. If "teža" or "тежа" is in an article, it would be indexed as "teza", but unchanged in the article.
Pro+: We also search an unchanged version of the text, so a query of "teža" would match "teža" somewhat better than "teza". It would not get any boost matching "тежа", though.
Con-: Words with non-Serbian Cyrillic would only be partially converted. So Russian "чёрная дыра" ("black hole") would be indexed as "cёrnaя dыra". I don't think that's a problem, but there could be weird collisions. The Faux Cyrillic band name "LIИKIИ PARK" would get indexed as "liikii park". ("KoЯn" comes out relatively unscathed.)
Con-: There is a chance there is some difference between the conversion done by the LanguageConverter (for display) and this plugin (or custom version of it, see below) that could cause confusion.
Con--: This is not clearly an official release from Elastic, so it could disappear or change.
- Pro+: We don't want to wait forever, though.
- Pro++: On the other hand, if all we need is a very simple mapping as above, then it would be very easy to write this as custom filter, and we could choose either the bald or not-bald version. (I think doing supporting both bald and not-bald would be hard and would require a level of customization for Serbian-language wikis that we aren't able to support.)

If the community supports using bald Latin for searching and matching—meaning that some distinctions could be lost—then this is straightforward. Of course I'd do the usual analysis and set up an instance in labs for testing before making it live, so we could be sure it works as intended.

TJones mentioned this in T138858: Serbian language search does not allows for use of bald Latin alphabet.Jun 29 2017, 5:30 PM

TJones moved this task from This Quarter to Tech Debt/Misc on the Discovery-Search board.Oct 24 2017, 5:35 PM

We would still need deal with the bald Latin search (T138858), but the upcoming Serbian analysis chain (T183015) will take care of the Cyrillic-vs-Latin search, in addition to doing some basic stemming.

There will still be a slight preference for exact matches in the same alphabet, just as there is a slight preference for exact matches that are stemmed. As an example of the stemming match, if you search for dogs then dog is a good match, but dogs is a slightly better match, via the plain field. The plain field also allows you to match exact forms with quotes, so that searching for "dogs" does not match dog.

TJones mentioned this in T183015: Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries.Mar 13 2018, 1:43 PM

This is working now on all Serbian-language wikis. Ranking still prefers exact matches, so Beograd and Београд give the same results but in a different order—and since there are over 40,000 results and many with partial title matches, the order difference can be significant.

More targeted searches like хусова улица and husova ulica with fewer results and only one really good title match give the same results.

(The effect on sister search is cool, and I'm entertained by searching for хот дог.)

Liuxinyu970226 unsubscribed.Apr 21 2018, 11:42 AM

*hot dog* :-P

Serbian language search differentiates between Cyrillic and Latin alphabetsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Serbian language search differentiates between Cyrillic and Latin alphabets
Closed, ResolvedPublic
Actions