Review Serbian Morphological Libraries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Oct 24 2017, 4:50 PM

Description

Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

SerbianStemmer https://github.com/nikolamilosevic86/SerbianStemmer
SCStemmers https://github.com/vukbatanovic/SCStemmers

SCStemmers, which implements four stemming algorithms, seems to include the algorithm used in SerbianStemmer, but it wouldn’t hurt to compare them. If SerbianStemmer were somehow superior, it would likely be possible to port the improvements to SCStemmers.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages
Open	None	T154511 [Tracking] Research, test, and deploy new language analyzers
Resolved	TJones	T171652 Language Analysis Morphological Library Research Spike
Resolved	TJones	T178926 Review Serbian Morphological Libraries
Resolved	TJones	T183015 Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries
Resolved	debt	T189239 Deploy initial version of the extra-analysis plugin
Resolved	debt	T189265 Re-index Serbian Wikis
Resolved	TJones	T196404 Re-Re-Index Serbian Wikis after refactored plugins are deployed
Resolved	TJones	T192395 Create Croatian, Serbo-Croatian, and Bosnian Analysis Chains Using Serbian Morphological Libraries
Resolved	TJones	T196658 Re-index Croatian, Serbo-Croatian, and Bosnian Wikis

Event Timeline

TJones created this task.Oct 24 2017, 4:50 PM

TJones mentioned this in T171652: Language Analysis Morphological Library Research Spike.Oct 24 2017, 4:52 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Nov 7 2017, 5:44 PM

zeljkofilipin subscribed.Nov 23 2017, 5:18 PM

zeljkofilipin awarded a token.Nov 24 2017, 9:35 AM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Nov 27 2017, 9:18 PM

My full write up is on MediaWiki.

TL;DR: Despite some bugs, the stemmers hold promise to improve search on Serbian-language projects, and others, too!

Highlights:

The Serbian stemmers convert Cyrillic to Latin (the mapping is, wonderfully, exactly 1:1) before stemming, but in the process deletes any characters that Java considers a "letter", but which is not a letter in either the Serbian Latin or Cyrillic alphabets. Words in other alphabets get stemmed to nothing (or to just their diacritics: तथा → ा, for example)—this affects Devanagari, Greek, Armenian, Hebrew, Arabic, Japanese, Chinese, Korean, and many others. I've filed a bug with the developer.
Some short words (demo, ivan, ivana and many others) get stemmed to nothing, and many words get stemmed to one or two letters (all of these and almost 100 more are stemmed to just "p": paba, paca, paja, paka, past, pega, pekao, pela, pele, pena, peni, pete, п, пава, пади, пака, паст, пата, пајa, паја, пева, пега, пегле, пекао, пела). I've filed another bug with the developer.
@zeljkofilipin kindly reviewed the other stemming groups, and everything else looks good, so if those two bugs can get fixed, this is a good candidate for a new stemmer. (Thanks!)
Depending on the developer's response, we can consider some combination of submitting pull requests, forking the project, or creating a wrapper in the plugin to prevent problems (e.g., don't stem strings with any "bad" characters and/or replace empty tokens with the original token).
Željko also pointed out that the stemmer is likely to work for Croatian, Serbo-Croatian, and Bosnian, too. So if the bugs get fixed, I'll test it against those as well.

Nemo_bis added a project: I18n.Dec 3 2017, 3:23 PM

Nemo_bis subscribed.

TJones moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Dec 5 2017, 6:13 PM

Player 4 Has Entered the Game!

The developer of the SCStemmers library, Vuk Batanović, has recommended stemmer #4 as the overall best. I hadn't considered it because it was labeled as "Croatian" and also didn't handle Cyrillic input, which is critical for Serbian. I've since learned more about Serbo-Croatian from Željko and WIkipedia, and Vuk has added a Cyrllic-to-Latin filter on the front end of the Croatian stemmer—so I'm planning to test it next. If it doesn't do well, we'll go back and look at adding a minimum stem length option to the other stemmers.

I've completed my analysis of stemmer #4 and it looks good, though it needs speaker review. (It's on a new page since the old page already has enough info and complexity in it.)

Assuming all goes well, the next steps include:

Get speaker review of the stemming groups
Get verification of the role of accents (it looks like there are pitch accent pronunciation hints using diacritics)
Open Phab ticket to build an Elasticsearch plugin using the stemmer and put together an analysis chain for Serbian
- Use the stemmer, accent folding, possibly other folding
- Re-test the analysis chain, expecting results similar to what we have here, modulo additional folding
- Deploy the analysis chain for Serbian!
Re-test the analysis chain for Croatian, Serbo-Croatian, and Bosnian wikis and deploy the analysis chain (or an appropriately modified version of it) wherever it will help. (Get 4 for the price of 1!)

zeljkofilipin added a project: User-zeljkofilipin.Dec 12 2017, 1:01 PM

zeljkofilipin moved this task from Backlog 🪒 to Deep work 🌊 on the User-zeljkofilipin board.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Dec 12 2017, 6:18 PM

Liuxinyu970226 awarded a token.Dec 15 2017, 2:27 PM

Liuxinyu970226 subscribed.

zeljkofilipin moved this task from Deep work 🌊 to Watching 📺 on the User-zeljkofilipin board.Dec 15 2017, 3:39 PM

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Dec 15 2017, 5:48 PM

I've marked this as Done. Thanks to Vuk and Željko for reviewing this for me. There are the usual kind of issues expected because language is messy—a few ambiguous words, names of people or places, acronyms, foreign words, etc. But it's generally doing the right thing for Serbian, and it doesn't have any unwanted side effects for non-Serbian text.

Next up: T183015: Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries

After that is deployed to Serbian wikis, I'll circle back to Croatian, Serbo-Croatian, and Bosnian wikis.

debt closed this task as Resolved.Dec 15 2017, 6:33 PM

Liuxinyu970226 unsubscribed.Dec 16 2017, 3:03 AM

TJones mentioned this in T190816: Add support for external stemmer to Analyzer Analysis tools.Mar 27 2018, 2:43 PM

debt closed subtask T183015: Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries as Resolved.Apr 27 2018, 8:18 PM

TJones mentioned this in T196780: Review Applying Indonesian Analysis Chain for Malay.Jun 8 2018, 8:23 PM

Restricted Application added a subscriber: • Petar.petkovic. · View Herald TranscriptJun 8 2018, 8:23 PM