Page MenuHomePhabricator

Review Serbian Morphological Libraries
Closed, ResolvedPublic

Description

Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

SCStemmers, which implements four stemming algorithms, seems to include the algorithm used in SerbianStemmer, but it wouldn’t hurt to compare them. If SerbianStemmer were somehow superior, it would likely be possible to port the improvements to SCStemmers.

Event Timeline

My full write up is on MediaWiki.

TL;DR: Despite some bugs, the stemmers hold promise to improve search on Serbian-language projects, and others, too!

Highlights:

  • The Serbian stemmers convert Cyrillic to Latin (the mapping is, wonderfully, exactly 1:1) before stemming, but in the process deletes any characters that Java considers a "letter", but which is not a letter in either the Serbian Latin or Cyrillic alphabets. Words in other alphabets get stemmed to nothing (or to just their diacritics: तथा → ा, for example)—this affects Devanagari, Greek, Armenian, Hebrew, Arabic, Japanese, Chinese, Korean, and many others. I've filed a bug with the developer.
  • Some short words (demo, ivan, ivana and many others) get stemmed to nothing, and many words get stemmed to one or two letters (all of these and almost 100 more are stemmed to just "p": paba, paca, paja, paka, past, pega, pekao, pela, pele, pena, peni, pete, п, пава, пади, пака, паст, пата, пајa, паја, пева, пега, пегле, пекао, пела). I've filed another bug with the developer.
  • @zeljkofilipin kindly reviewed the other stemming groups, and everything else looks good, so if those two bugs can get fixed, this is a good candidate for a new stemmer. (Thanks!)
  • Depending on the developer's response, we can consider some combination of submitting pull requests, forking the project, or creating a wrapper in the plugin to prevent problems (e.g., don't stem strings with any "bad" characters and/or replace empty tokens with the original token).
  • Željko also pointed out that the stemmer is likely to work for Croatian, Serbo-Croatian, and Bosnian, too. So if the bugs get fixed, I'll test it against those as well.

Player 4 Has Entered the Game!

The developer of the SCStemmers library, Vuk Batanović, has recommended stemmer #4 as the overall best. I hadn't considered it because it was labeled as "Croatian" and also didn't handle Cyrillic input, which is critical for Serbian. I've since learned more about Serbo-Croatian from Željko and WIkipedia, and Vuk has added a Cyrllic-to-Latin filter on the front end of the Croatian stemmer—so I'm planning to test it next. If it doesn't do well, we'll go back and look at adding a minimum stem length option to the other stemmers.

I've completed my analysis of stemmer #4 and it looks good, though it needs speaker review. (It's on a new page since the old page already has enough info and complexity in it.)

Assuming all goes well, the next steps include:

  • Get speaker review of the stemming groups
  • Get verification of the role of accents (it looks like there are pitch accent pronunciation hints using diacritics)
  • Open Phab ticket to build an Elasticsearch plugin using the stemmer and put together an analysis chain for Serbian
    • Use the stemmer, accent folding, possibly other folding
    • Re-test the analysis chain, expecting results similar to what we have here, modulo additional folding
    • Deploy the analysis chain for Serbian!
  • Re-test the analysis chain for Croatian, Serbo-Croatian, and Bosnian wikis and deploy the analysis chain (or an appropriately modified version of it) wherever it will help. (Get 4 for the price of 1!)

I've marked this as Done. Thanks to Vuk and Željko for reviewing this for me. There are the usual kind of issues expected because language is messy—a few ambiguous words, names of people or places, acronyms, foreign words, etc. But it's generally doing the right thing for Serbian, and it doesn't have any unwanted side effects for non-Serbian text.

Next up: T183015: Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries

After that is deployed to Serbian wikis, I'll circle back to Croatian, Serbo-Croatian, and Bosnian wikis.