Page MenuHomePhabricator

Language Analysis Morphological Library Research Spike
Closed, ResolvedPublic

Description

We've pretty much made it through the list of analyzers recommended by or at least pointed to by Elastic as part of T154511: [Tracking] Research, test, and deploy new language analyzers.

Several of the analyzer plugins are mainly wrappers around some other third-party open-source morphological library. So, maybe it wouldn't be that hard to wrap a plugin around another existing open-source morphological library. Of course, it would depend on details of the library: grammatical completeness, programming language, code maturity, how well maintained it is, etc.

Below I've put together a heuristically sorted list of languages currently without language-specific analyzers. <nerd>I ranked by number of articles in the respective Wikipedias (W) and ranked by volume of search requests (S). The final ranking is (300-W)*(300-S)^1.5</nerd> This ranking that takes into account Wikipedia size and search volume, with a higher weight on search volume. The article-count outliers Cebuano and Waray are still on the list, but very much farther down than where article count alone would place them. It's probably a good thing that the two recently abandoned analyzers, Japanese and Vietnamese, are at the top of the list, because it at least hints that our old list and new list mesh reasonably well.

The goal of the research spike would be to time box an investigation of these (say, two days) in order to try to answer these questions:

  • Are they actually using the default non-language-specific analyzer? (probably, but if not, document!)
  • Do open-source elasticsearch plugins exist for these languages? (probably not, but if so, get excited!)
  • Do other open-source morphological libraries for these languages exist? (if so, document them here!)

An initial research spike would probably not be enough time to evaluate all of them (unless things go very poorly), but would give a sense of what's out there and wether it's worth it to continue with this line of investigation.

Depending on what exists, how mature the code and coverage is, and other factors, it might be worthwhile to spin off separate tasks to more deeply assess particular morphological libraries, to try to wrap them into Elasticsearch plugins, or to encourage volunteers to do so, etc.

A few additional notes:

  • There are certainly other approaches that might make sense on a case by case basis. For example, for particularly similar languages it may be possible and even easier to adapt an existing morphological library from one language to the other. Indonesian to Malay might be a candidate, for example.
  • There are also possibly varieties listed or not listed here that should be considered together, like maybe Serbian, Croatian, and Serbo-Croatian.

The top 50 languages on my list are:

  • Japanese (ja)
  • Vietnamese (vi)
  • Korean (ko)
  • Serbian (sr) / Croatian (hr) / Serbo-Croatian (sh) / Bosnian (bs)
  • Malay (ms)
  • Estonian (et)
  • Slovak (sk)

That's as far as we got with reviews on the first spike. The rest of the list is below.

  • Tagalog (tl)
  • Tamil (ta)
  • Belarusian (be)
  • Georgian (ka)
  • Azerbaijani (az)
  • Kazakh (kk)
  • Urdu (ur)
  • Latin (la)
  • Esperanto (eo)
  • Malayalam (ml)
  • Telugu (te)
  • Bengali (bn)
  • Cebuano (ceb)
  • Uzbek (uz)
  • Albanian (sq)
  • Marathi (mr)
  • Macedonian (mk)
  • Cantonese (zh-yue)
  • Afrikaans (af)
  • Welsh (cy)
  • Gujarati (gu)
  • Burmese (my)
  • Kannada (kn)
  • Breton (br)
  • Icelandic (is)
  • Sinhalese (si)
  • Swahili (sw)
  • Tatar (tt)
  • Tajik (tg)
  • Kurdish (Kurmanji) (ku)
  • Mongolian (mn)
  • Luxembourgish (lb)
  • Scots (sco)
  • Eastern Punjabi (pa)
  • Nepali (ne)
  • Egyptian Arabic (arz)
  • Sicilian (scn)
  • Occitan (oc)
  • Waray (war)
  • Asturian (ast)

Related Objects

StatusSubtypeAssignedTask
InvalidNone
OpenNone
ResolvedTJones
OpenNone
OpenNone
ResolvedTJones
ResolvedTJones
ResolvedTJones
OpenNone
ResolvedTJones
ResolvedTJones
Resolveddebt
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedGehel
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones

Event Timeline

TJones renamed this task from Language Analysis Library Research Spike to Language Analysis Morphological Library Research Spike.Jul 25 2017, 8:41 PM
debt triaged this task as Medium priority.Jul 26 2017, 2:21 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt added a project: Discovery-ARCHIVED.

@TJones: great writeup! But, do you think time-boxing this to ~2 days will be enough to get through nearly 50 languages (since a few have already been done)?

@debt, no, there's likely no way to get through all 50 in two days. I was originally only going to list 20, but I thought that might be possible in two days, especially if things go really poorly. My thought was to get as much variety as possible in 2 days, and then re-assess. We could always do another day or three afterwards depending on how that first pass goes. If there's lots of good info to sort through, two days might only cover 8 or 10 languages.

The real value comes after we identify a likely target and get someone (us, the language engineering team, community volunteers, the author(s) of the morphology library) to build a usable plugin!

Sounds good, thanks for the clarification! :)

TJones updated the task description. (Show Details)
TJones moved this task from Up Next to Current work on the Discovery-Search board.

Perhaps worth noting that I'm pretty sure http://discovery.wmflabs.org/metrics/#langproj_breakdown isn't a true breakdown of search volume, although i should double check with @mpopov . I think that's a proportion of events in the TestSeachSatisfaction schema. The sampling on low volume wikis is all the same, but the top 20 or so have custom sampling rates which means we can't directly compare the numbers.

Perhaps worth noting that I'm pretty sure http://discovery.wmflabs.org/metrics/#langproj_breakdown isn't a true breakdown of search volume, although i should double check with @mpopov . I think that's a proportion of events in the TestSeachSatisfaction schema. The sampling on low volume wikis is all the same, but the top 20 or so have custom sampling rates which means we can't directly compare the numbers.

The volume % shares are based on which metric is selected. When TSS2-based metrics such as PaulScore and CTR are selected, the %s are calculated from that data. When ZRR is selected, the %s are calculated from ZRR lang-proj breakdown data which comes from Cirrus logs.

We should probably change it so the volume %s are calculated from ZRR no matter which metric is selected, so the volume % is always the "true" proportion of total search volume. What do you think, @chelsyx?

@mpopov Agree. That would be less confusing as well.

the volume %s are calculated from ZRR no matter which metric is selected, so the volume % is always the "true" proportion of total search volume

Sounds like a great upgrade! :)

Recurring themes:

  • Not everything is usefully licensed.
  • Code gets abandoned.
  • Useful code may exist that is not in English.
  • Java is easiest, but not everything is in Java.
  • Sometimes all that exists are research papers.

Selection Criteria:

  • Code that’s doesn’t seem abandoned.
  • Code that’s in a reasonable programming language.
  • Code that isn’t in a huge library and doesn’t have massive dependencies.
  • Code that looks to be reasonably mature.

Other important criteria for actual development and deployment (which would be assessed in a follow-up task) include:

  • Accuracy of analysis.
  • Ability to be integrated.
  • Run-time performance.

Based on my review of these seven languages, I suggest testing some of the software packages. Fortunately, we don’t need to commit to full Elasticsearch integration to perform our standard testing. As long as we can run the analysis and map analyzed tokens back to their original text, we can do a most of the language analysis analysis to determine whether the analyzer is worth pursuing for integration.

  • For Japanese, I want to look at MeCab, tinysegmenter, and possibly CaboCha. T178923
  • For Vietnamese, I want to look at vnTokenzizer. T178924
  • For Korean, I want to look at the newer module named mecab-ko-lucene-analyzer—there are two! T178925
  • For Serbian, I want to test both available stemmers: SerbianStemmer and SCStemmers. T178926
  • For Malay, I was only able to find research papers—nothing implemented or implementable that I could find.
  • For Estonian, I want to look at Vabamorf. T178928
  • For Slovak, I want to try both of the available stemmers: stemm-sk and Stemmer-sk. T178929

I expect some failures. Two of the language analyzers maintained or suggested by Elasticsearch (Japanese and Vietnamese) did not perform as well as we needed them to. However, several others did: those for Polish, Hebrew, Ukrainian, and Chinese (which involved two plugins being melded together). Right now, six of the seven languages I investigated yielded something worth following up on. We’ll see how many of those turn into something usable—if it’s two or three, this is a definitely a process worth repeating. If it is zero, then maybe we need to let the language analyzers mature on their own and come to us when they are ready.

More details and raw notes on everything I looked at on my notes page.

@TJones - I'll keep this in the done column for a bit but not close it out, in case we want to refer back to the notes here while the multiple subtasks are being worked on. Thanks for the detailed explanation of your work! :)

I updated the description and moved Croatian (hr) / Serbo-Croatian (sh) / Bosnian (bs) up with Serbian since we will be deploying the same stemmer for all of them (see T192395).