Page MenuHomePhabricator

Review Korean Morphological Libraries
Closed, ResolvedPublic

Description

Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

Event Timeline

First draft of my analysis of Nori is on MediaWiki.

Summary: Tokenization, part of speech tagging and filtering, Korean-specific concerns (dealing with Hanja), and lots of ambiguity leaves us with a lot of complexity and a whole lot for speaker review. I have concerns about some of it, so we really need a good review before we can implement this. Setting up a demo on RelForge is also a possibility.

I'll start looking for reviewers next week when I get back from PTO.

I've asked for help reviewing on the Korean Wikipedia Village Pump and the Korean Wiktionary discussion forum. I contacted five Wikipedians who have volunteered to help English speakers on Korean Wikipedia, and may contact a few more if I don't hear back from many of them. I also found contact info for the Elasticsearch engineer who wrote the blog post on the new Korean analyzer, so I've emailed him.

Next up, upstream tickets for CJK and Nori issues that I found, and responding to those above who have questions, comments, etc.

I know some Korean and I'd be happy to help with this task if you don't hear from native Korean speakers.

@bmansurov, if you have the time, I'd love for you to take a look! No one else has agreed to look yet, but even if they do, having multiple sets of eyes on it is a good thing. Thanks!

@TJones, OK, I'll take a look. I'll leave a comment here when I'm done.

I've opened upstream tickets for Nori and CJK analyzers based on my analysis. (The Nori ticket got pushed back to Lucene.)

I've left some notes on the talk page. I'll do the remaining bits as I find some spare time.

Thanks, Baha! I'm working on a reply right now, though it's taking a bit longer than expected. Your help is very much appreciated! (And I think you undersold your Korean skills on the language skills page.)

@TJones, OK, I'll wait for your reply and see what I should do differently while doing the rest. (Thanks for the compliment.)

@bmansurov, nothing to do differently on your end—that was a great review/analysis! I looked into the discrepancies to see what I could find, documented them for myself, and possibly to inform stuff you or anyone else looks at later, to see if any patterns of fixable or reportable problems emerge.

EDIT: And I see you've already replied on the talk page. Thanks!

Speaker review is generally positive, but there are a couple of parts of speech that keep coming up as not really helpful, so I'm going to try filtering them and see what kind of diff that generates. This may or may not require another round of speaker review, depending on the impact.

\o/ I see you got some input from a native speaker for the remaining sections, @TJones.

Indeed, @bmansurov, but your help is very much appreciated, too!

Filtering out the problem parts of speech looks good, so this is ready to be built out in Analysis Config Builder, but we need to upgrade to ES 6.4.2 (or at least be able to build a test environment will all the usual suspects plugins).

I'll open a child task for the next portion of the config and testing (and later we'll need another for deployment and re-indexing, as per usual).