Language Analysis Morphological Library Research Spike
Closed, ResolvedPublic
Actions

Description

We've pretty much made it through the list of analyzers recommended by or at least pointed to by Elastic as part of T154511: [Tracking] Research, test, and deploy new language analyzers.

Several of the analyzer plugins are mainly wrappers around some other third-party open-source morphological library. So, maybe it wouldn't be that hard to wrap a plugin around another existing open-source morphological library. Of course, it would depend on details of the library: grammatical completeness, programming language, code maturity, how well maintained it is, etc.

Below I've put together a heuristically sorted list of languages currently without language-specific analyzers. <nerd>I ranked by number of articles in the respective Wikipedias (W) and ranked by volume of search requests (S). The final ranking is (300-W)*(300-S)^1.5</nerd> This ranking that takes into account Wikipedia size and search volume, with a higher weight on search volume. The article-count outliers Cebuano and Waray are still on the list, but very much farther down than where article count alone would place them. It's probably a good thing that the two recently abandoned analyzers, Japanese and Vietnamese, are at the top of the list, because it at least hints that our old list and new list mesh reasonably well.

The goal of the research spike would be to time box an investigation of these (say, two days) in order to try to answer these questions:

Are they actually using the default non-language-specific analyzer? (probably, but if not, document!)
Do open-source elasticsearch plugins exist for these languages? (probably not, but if so, get excited!)
Do other open-source morphological libraries for these languages exist? (if so, document them here!)

An initial research spike would probably not be enough time to evaluate all of them (unless things go very poorly), but would give a sense of what's out there and wether it's worth it to continue with this line of investigation.

Depending on what exists, how mature the code and coverage is, and other factors, it might be worthwhile to spin off separate tasks to more deeply assess particular morphological libraries, to try to wrap them into Elasticsearch plugins, or to encourage volunteers to do so, etc.

A few additional notes:

There are certainly other approaches that might make sense on a case by case basis. For example, for particularly similar languages it may be possible and even easier to adapt an existing morphological library from one language to the other. Indonesian to Malay might be a candidate, for example.
There are also possibly varieties listed or not listed here that should be considered together, like maybe Serbian, Croatian, and Serbo-Croatian.

The top 50 languages on my list are:

Japanese (ja)
Vietnamese (vi)
Korean (ko)
Serbian (sr) / Croatian (hr) / Serbo-Croatian (sh) / Bosnian (bs)
Malay (ms)
Estonian (et)
Slovak (sk)

That's as far as we got with reviews on the first spike. The rest of the list is below.

Tagalog (tl)
Tamil (ta)
Belarusian (be)
Georgian (ka)
Azerbaijani (az)
Kazakh (kk)
Urdu (ur)
Latin (la)
Esperanto (eo)
Malayalam (ml)
Telugu (te)
Bengali (bn)
Cebuano (ceb)
Uzbek (uz)
Albanian (sq)
Marathi (mr)
Macedonian (mk)
Cantonese (zh-yue)
Afrikaans (af)
Welsh (cy)
Gujarati (gu)
Burmese (my)
Kannada (kn)
Breton (br)
Icelandic (is)
Sinhalese (si)
Swahili (sw)
Tatar (tt)
Tajik (tg)
Kurdish (Kurmanji) (ku)
Mongolian (mn)
Luxembourgish (lb)
Scots (sco)
Eastern Punjabi (pa)
Nepali (ne)
Egyptian Arabic (arz)
Sicilian (scn)
Occitan (oc)
Waray (war)
Asturian (ast)

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages
Open	None	T154511 [Tracking] Research, test, and deploy new language analyzers
Resolved	TJones	T171652 Language Analysis Morphological Library Research Spike
Open	None	T178923 Review Japanese Morphological Libraries
Open	None	T178924 Review Vietnamese Morphological Libraries
Resolved	TJones	T178925 Review Korean Morphological Libraries
Resolved	TJones	T206874 Add Nori (Korean) configuration to AnalysisConfigBuilder
Resolved	TJones	T216738 Reindex Korean-language wikis to enable Nori analyzer
Open	None	T219534 Test MLR models for zhwiki, jawiki and kowiki
Resolved	TJones	T178926 Review Serbian Morphological Libraries
Resolved	TJones	T183015 Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries
Resolved	debt	T189239 Deploy initial version of the extra-analysis plugin
Resolved	debt	T189265 Re-index Serbian Wikis
Resolved	TJones	T196404 Re-Re-Index Serbian Wikis after refactored plugins are deployed
Resolved	TJones	T192395 Create Croatian, Serbo-Croatian, and Bosnian Analysis Chains Using Serbian Morphological Libraries
Resolved	TJones	T196658 Re-index Croatian, Serbo-Croatian, and Bosnian Wikis
Resolved	TJones	T178928 Review Estonian Morphological Libraries
Resolved	TJones	T178929 Review Slovak Morphological Libraries
Resolved	TJones	T190815 Create Slovak Elasticsearch Plugin/Analysis Chain Using Slovak Stemming Algorithm
Resolved	Gehel	T191543 Deploy updated search/extra plugin and search/extra-analysis-slovak plugin with Slovak Stemmer
Resolved	TJones	T191544 Deploy the analysis config for the new Slovak stemmer
Resolved	TJones	T191545 Re-index Slovak Wikis after analysis chain is deployed
Resolved	TJones	T196780 Review Applying Indonesian Analysis Chain for Malay
Resolved	TJones	T200204 Re-index Malay and Indonesian Wikis to use new unpacked analysis chain
Resolved	TJones	T197240 Review Esperanto Morphological Libraries
Resolved	TJones	T200099 Create Esperanto Elasticsearch Plugin Using Esperanto Morphological Libraries
Resolved	TJones	T202173 Create Esperanto Analysis Chain using new Esperanto Plugin
Resolved	TJones	T202662 Esperanto Stemmer Updates
Resolved	TJones	T203005 Re-index Esperanto Wikis

Event Timeline

TJones created this task.Jul 25 2017, 8:41 PM

Restricted Application added subscribers: revi, Aklapper. · View Herald TranscriptJul 25 2017, 8:41 PM

TJones renamed this task from Language Analysis Library Research Spike to Language Analysis Morphological Library Research Spike.Jul 25 2017, 8:41 PM

TJones added a parent task: T154511: [Tracking] Research, test, and deploy new language analyzers.

@TJones: great writeup! But, do you think time-boxing this to ~2 days will be enough to get through nearly 50 languages (since a few have already been done)?

@debt, no, there's likely no way to get through all 50 in two days. I was originally only going to list 20, but I thought that might be possible in two days, especially if things go really poorly. My thought was to get as much variety as possible in 2 days, and then re-assess. We could always do another day or three afterwards depending on how that first pass goes. If there's lots of good info to sort through, two days might only cover 8 or 10 languages.

The real value comes after we identify a likely target and get someone (us, the language engineering team, community volunteers, the author(s) of the morphology library) to build a usable plugin!

Sounds good, thanks for the clarification! :)

Liuxinyu970226 added a project: Bengali-Sites.Jul 27 2017, 12:44 AM

Liuxinyu970226 added a project: Malayalam-Sites.

Liuxinyu970226 added a project: Tamil-Sites.

TJones claimed this task.Oct 5 2017, 4:02 PM

TJones updated the task description. (Show Details)

TJones moved this task from Up Next to Current work on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Oct 5 2017, 4:04 PM

Perhaps worth noting that I'm pretty sure http://discovery.wmflabs.org/metrics/#langproj_breakdown isn't a true breakdown of search volume, although i should double check with @mpopov . I think that's a proportion of events in the TestSeachSatisfaction schema. The sampling on low volume wikis is all the same, but the top 20 or so have custom sampling rates which means we can't directly compare the numbers.

In T171652#3672703, @EBernhardson wrote:

Perhaps worth noting that I'm pretty sure http://discovery.wmflabs.org/metrics/#langproj_breakdown isn't a true breakdown of search volume, although i should double check with @mpopov . I think that's a proportion of events in the TestSeachSatisfaction schema. The sampling on low volume wikis is all the same, but the top 20 or so have custom sampling rates which means we can't directly compare the numbers.

The volume % shares are based on which metric is selected. When TSS2-based metrics such as PaulScore and CTR are selected, the %s are calculated from that data. When ZRR is selected, the %s are calculated from ZRR lang-proj breakdown data which comes from Cirrus logs.

We should probably change it so the volume %s are calculated from ZRR no matter which metric is selected, so the volume % is always the "true" proportion of total search volume. What do you think, @chelsyx?

@mpopov Agree. That would be less confusing as well.

the volume %s are calculated from ZRR no matter which metric is selected, so the volume % is always the "true" proportion of total search volume

Sounds like a great upgrade! :)

TJones mentioned this in T178923: Review Japanese Morphological Libraries.Oct 24 2017, 4:47 PM

TJones created subtask T178923: Review Japanese Morphological Libraries.

TJones mentioned this in T178924: Review Vietnamese Morphological Libraries.

TJones created subtask T178924: Review Vietnamese Morphological Libraries.

TJones mentioned this in T178925: Review Korean Morphological Libraries.

TJones created subtask T178925: Review Korean Morphological Libraries.

TJones mentioned this in T178926: Review Serbian Morphological Libraries.Oct 24 2017, 4:50 PM

TJones created subtask T178926: Review Serbian Morphological Libraries.

TJones mentioned this in T178928: Review Estonian Morphological Libraries.

TJones created subtask T178928: Review Estonian Morphological Libraries.

TJones mentioned this in T178929: Review Slovak Morphological Libraries.

TJones created subtask T178929: Review Slovak Morphological Libraries.

Recurring themes:

Not everything is usefully licensed.
Code gets abandoned.
Useful code may exist that is not in English.
Java is easiest, but not everything is in Java.
Sometimes all that exists are research papers.

Selection Criteria:

Code that’s doesn’t seem abandoned.
Code that’s in a reasonable programming language.
Code that isn’t in a huge library and doesn’t have massive dependencies.
Code that looks to be reasonably mature.

Other important criteria for actual development and deployment (which would be assessed in a follow-up task) include:

Accuracy of analysis.
Ability to be integrated.
Run-time performance.

Based on my review of these seven languages, I suggest testing some of the software packages. Fortunately, we don’t need to commit to full Elasticsearch integration to perform our standard testing. As long as we can run the analysis and map analyzed tokens back to their original text, we can do a most of the language analysis analysis to determine whether the analyzer is worth pursuing for integration.

For Japanese, I want to look at MeCab, tinysegmenter, and possibly CaboCha. T178923
For Vietnamese, I want to look at vnTokenzizer. T178924
For Korean, I want to look at the newer module named mecab-ko-lucene-analyzer—there are two! T178925
For Serbian, I want to test both available stemmers: SerbianStemmer and SCStemmers. T178926
For Malay, I was only able to find research papers—nothing implemented or implementable that I could find.
For Estonian, I want to look at Vabamorf. T178928
For Slovak, I want to try both of the available stemmers: stemm-sk and Stemmer-sk. T178929

I expect some failures. Two of the language analyzers maintained or suggested by Elasticsearch (Japanese and Vietnamese) did not perform as well as we needed them to. However, several others did: those for Polish, Hebrew, Ukrainian, and Chinese (which involved two plugins being melded together). Right now, six of the seven languages I investigated yielded something worth following up on. We’ll see how many of those turn into something usable—if it’s two or three, this is a definitely a process worth repeating. If it is zero, then maybe we need to let the language analyzers mature on their own and come to us when they are ready.

More details and raw notes on everything I looked at on my notes page.

TJones updated the task description. (Show Details)Oct 24 2017, 4:54 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

Liuxinyu970226 subscribed.Oct 25 2017, 12:36 AM

@TJones - I'll keep this in the done column for a bit but not close it out, in case we want to refer back to the notes here while the multiple subtasks are being worked on. Thanks for the detailed explanation of your work! :)

Psychoslave subscribed.Oct 30 2017, 11:05 PM

Nemo_bis added a project: I18n.Dec 3 2017, 3:24 PM

Nemo_bis subscribed.

debt closed subtask T178926: Review Serbian Morphological Libraries as Resolved.Dec 15 2017, 6:33 PM

TJones closed this task as Resolved.Feb 13 2018, 6:25 PM

Liuxinyu970226 unsubscribed.Feb 14 2018, 2:33 AM

debt closed subtask T178929: Review Slovak Morphological Libraries as Resolved.Apr 3 2018, 3:48 PM

I updated the description and moved Croatian (hr) / Serbo-Croatian (sh) / Bosnian (bs) up with Serbian since we will be deploying the same stemmer for all of them (see T192395).

TJones mentioned this in T197240: Review Esperanto Morphological Libraries.Jun 14 2018, 3:14 PM

Restricted Application added a subscriber: • Petar.petkovic. · View Herald TranscriptJun 14 2018, 3:14 PM

debt closed subtask T178928: Review Estonian Morphological Libraries as Resolved.Jun 14 2018, 8:01 PM

debt closed subtask T197240: Review Esperanto Morphological Libraries as Resolved.Jul 31 2018, 5:42 PM

debt closed subtask T196780: Review Applying Indonesian Analysis Chain for Malay as Resolved.Jul 31 2018, 5:46 PM

debt closed subtask T178925: Review Korean Morphological Libraries as Resolved.Oct 19 2018, 2:44 PM

Bodhisattwa moved this task from Backlog to Closed on the Bengali-Sites board.Nov 19 2020, 4:03 AM

FriedrickMILBarbarossa added projects: Serbian-Sites, Croatian-Sites, Esperanto-Sites, Gujarati-Sites.Nov 16 2021, 7:29 PM

FriedrickMILBarbarossa moved this task from Backlog to Closed on the Serbian-Sites board.Nov 16 2021, 7:32 PM

FriedrickMILBarbarossa moved this task from Backlog to Closed on the Croatian-Sites board.Dec 16 2023, 6:24 PM

Language Analysis Morphological Library Research SpikeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Language Analysis Morphological Library Research Spike
Closed, ResolvedPublic
Actions

Related Objects
Search...