Review Korean Morphological Libraries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Oct 24 2017, 4:49 PM

Description

Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

mecab-ko-lucene-analyzer https://github.com/jaepil/mecab-ko-lucene-analyzer

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages
Open	None	T154511 [Tracking] Research, test, and deploy new language analyzers
Resolved	TJones	T171652 Language Analysis Morphological Library Research Spike
Resolved	TJones	T178925 Review Korean Morphological Libraries
Resolved	TJones	T206874 Add Nori (Korean) configuration to AnalysisConfigBuilder
Resolved	TJones	T216738 Reindex Korean-language wikis to enable Nori analyzer
Open	None	T219534 Test MLR models for zhwiki, jawiki and kowiki

Event Timeline

TJones created this task.Oct 24 2017, 4:49 PM

Restricted Application added a subscriber: revi. · View Herald TranscriptOct 24 2017, 4:49 PM

TJones mentioned this in T171652: Language Analysis Morphological Library Research Spike.Oct 24 2017, 4:52 PM

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).Oct 24 2017, 5:14 PM

TJones moved this task from needs triage to Up Next on the Discovery-Search board.

TJones moved this task from Up Next to Current work on the Discovery-Search board.Sep 4 2018, 7:39 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

First draft of my analysis of Nori is on MediaWiki.

Summary: Tokenization, part of speech tagging and filtering, Korean-specific concerns (dealing with Hanja), and lots of ambiguity leaves us with a lot of complexity and a whole lot for speaker review. I have concerns about some of it, so we really need a good review before we can implement this. Setting up a demo on RelForge is also a possibility.

I'll start looking for reviewers next week when I get back from PTO.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 27 2018, 6:51 PM

revi awarded a token.Sep 27 2018, 7:03 PM

I've asked for help reviewing on the Korean Wikipedia Village Pump and the Korean Wiktionary discussion forum. I contacted five Wikipedians who have volunteered to help English speakers on Korean Wikipedia, and may contact a few more if I don't hear back from many of them. I also found contact info for the Elasticsearch engineer who wrote the blog post on the new Korean analyzer, so I've emailed him.

Next up, upstream tickets for CJK and Nori issues that I found, and responding to those above who have questions, comments, etc.

I know some Korean and I'd be happy to help with this task if you don't hear from native Korean speakers.

@bmansurov, if you have the time, I'd love for you to take a look! No one else has agreed to look yet, but even if they do, having multiple sets of eyes on it is a good thing. Thanks!

@TJones, OK, I'll take a look. I'll leave a comment here when I'm done.

I've opened upstream tickets for Nori and CJK analyzers based on my analysis. (The Nori ticket got pushed back to Lucene.)

I've left some notes on the talk page. I'll do the remaining bits as I find some spare time.

Thanks, Baha! I'm working on a reply right now, though it's taking a bit longer than expected. Your help is very much appreciated! (And I think you undersold your Korean skills on the language skills page.)

@TJones, OK, I'll wait for your reply and see what I should do differently while doing the rest. (Thanks for the compliment.)

@bmansurov, nothing to do differently on your end—that was a great review/analysis! I looked into the discrepancies to see what I could find, documented them for myself, and possibly to inform stuff you or anyone else looks at later, to see if any patterns of fixable or reportable problems emerge.

EDIT: And I see you've already replied on the talk page. Thanks!

Speaker review is generally positive, but there are a couple of parts of speech that keep coming up as not really helpful, so I'm going to try filtering them and see what kind of diff that generates. This may or may not require another round of speaker review, depending on the impact.

\o/ I see you got some input from a native speaker for the remaining sections, @TJones.

Indeed, @bmansurov, but your help is very much appreciated, too!

TJones moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Oct 12 2018, 6:44 PM

Filtering out the problem parts of speech looks good, so this is ready to be built out in Analysis Config Builder, but we need to upgrade to ES 6.4.2 (or at least be able to build a test environment will all the usual ~~suspects~~ plugins).

I'll open a child task for the next portion of the config and testing (and later we'll need another for deployment and re-indexing, as per usual).

debt closed this task as Resolved.Oct 19 2018, 2:44 PM

debt closed subtask T206874: Add Nori (Korean) configuration to AnalysisConfigBuilder as Resolved.Feb 22 2019, 8:34 PM

TJones mentioned this in T317476: Filter and sort search results of Japanese kana search queries in accordance with how much of the query appears as a consecutive substring.Sep 12 2022, 6:38 PM

Review Korean Morphological LibrariesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Review Korean Morphological Libraries
Closed, ResolvedPublic
Actions

Related Objects
Search...