Page MenuHomePhabricator

Test and analyze Kuromoji & Sudachi Japanese language analyzers
Closed, ResolvedPublic13 Estimated Story Points

Description

User Story: As a user of a Japanese-language wiki, I'd like better language processing than overlapping bigrams. The Kuromoji or Sudachi analyzers might well be up to the task.

Japanese is a major language (13th most speakers) with a large Wikipedia (also 13th by article count), a robust on-wiki community (5th by active users), and high search volume (6th by unique queries). The language and its writing system are complex, and word segmentation is particularly challenging, but overall it is well-supported by modern NLP libraries, including ones available for Lucene (and thus Elasticsearch / OpenSearch), such as Kuromoji.

Nonetheless, we currently use a very simplistic approach to parsing Japanese, namely overlapping bigrams. In English, this would be not quite as bad as parsing statesman into st, ta, at, te, es, sm, ma, and an, searching on those bigrams, and trying not to be surprised when "International Politics and the Establishment of Presbyterianism in the Channel Islands" is returned as the top result.

The current scattershot bigram approach is much better than nothing (i.e., requiring exact string matches), but it is not very precise—which is why we have previously moved away from it for Chinese and Korean.

It's been a bit more than five years since we last looked at Kuromoji (T166731). In that time, it has probably gotten better, and I expect my ability to deal with shortcomings in analyzers has also gotten better.*

────────
   * Experience is something you don't get until right after you need it.
 

Acceptance Criteria:

  • A write up of findings on the Kuromoji + Sudachi analyzers
  • Either...
    • ...include reasons why Kuromoji / Sudachi are unacceptable in the write up, or
    • ...a patch implementing the Kuromoji analyzer, the Sudachi analyzer, or both

Note: Updated task description to include Sudachi, and end with analysis changes, dropping the measurement criteria because we are going to delay deployment during the OpenSearch migration.

Event Timeline

TJones set the point value for this task to 13.Sep 26 2022, 4:01 PM

I'm on the fence between 8 & 13 story points (can I say 10?), so I'm going with the bigger number until we talk about it at a later meeting.

Moving this back to the backlog to focus on more straightforward unpacking. CJK analyzer unpacking for Japanese (T326822) is still underway.

TJones triaged this task as High priority.Jan 13 2023, 9:28 PM
TJones lowered the priority of this task from High to Medium.Mar 6 2023, 6:26 PM
TJones raised the priority of this task from Medium to High.Sep 19 2024, 2:47 PM
Gehel updated the task description. (Show Details)

Have there been evaluations for other tokenizers? Would love to see an evaluation of sudachi if there’s bandwidth

Have there been evaluations for other tokenizers? Would love to see an evaluation of sudachi if there’s bandwidth

@tchin, be the bandwidth you want to see in the world!

The samples I gave you to look at are for Kuromoji and the ICU tokenizer—which uses a different (newer? bigger?) dictionary. I added the ICU tokenizer to the mix because we've had good results with it elsewhere, and if we use it, we get ok to good tokenization in several other spaceless Asian languages at once, which is nice. So that's one other tokenizer. Sudachi seems like it would be another good one to look at.

It looks like Sudachi has a plugin for our current version of Elasticsearch, and it can at least be built for OpenSearch 2.6–2.17. I'm not sure what our initial target version of OpenSearch 2 is, but if Sudachi is awesome it might be worth it to aim for 2.6+. (Alternatively, it might be easy enough to compile Sudachi for a slightly lower version of OpenSearch if that's necessary for some reason.)

So! I will try to install and investigate the Sudachi plugin and look for the issues I usually look for. (It's surprising how many analyzers will just eat foreign scripts, or go nuts on a particular rare punctuation character, etc.) If I can get it running next week, I'll re-tokenize the same sample sentences and put them where you can review them, if you are up for it.

My notes on Kuromoji are now on Mediawiki. I still need to add my Sudachi notes and the results of the speaker review of the tokenization, but I'm trying to share reasonably coherent chunks as I finish writing them up.

My notes on Sudachi have been added to my ongoing documentation, along with some basic info on load speed.

Sudachi has a lot of quirks, and it is slow.... but maybe it's worth it? (Foreshadowing!)

If you have any questions I could try to ask them on the sudachi slack?

I've added notes on the speaker review—thanks @tchin & @jeena!! It includes lots of numbers and a lovely graph showing that Sudachi really is better at processing Japanese text, despite it's quirks.

Up next I want to commit my custom config for Kuromoji and Sudachi, then I will close this ticket and open a new one for reviewing dictionaries, and try to figure out how to use a custom dictionary for either Kuromoji or Sudachi, to see if we can either upgrade Kuromoji's parsing or tame some of Sudachi's quirks for our use cases. We might fall back to Kuromoji if the Sudachi dictionary can be made to work well with it, since Kuromoji appears to be better supported for OpenSearch 1.x. More details in the notes.

If you have any questions I could try to ask them on the sudachi slack?

I don't think I have any questions of the "how do I do this?" sort. I have a few "why does it work like this?" questions, but they don't need to be answered. I have some suggestions for improvements, which could go through Slack, but I was planning to just open a ticket on GitHub.

I opened a MegaTicket™ on the Sudachi HitHub repo describing a lot of the issues I found, in case they want to address any of them. I don't expect them to move fast enough to solve our problems before we want to deploy stuff—and they may not agree that all my issues need fixing on their end—but any incremental progress is still good progress.

TJones tried to set Final Story Points to "21? A million?" but it didn't work. Phab has no sense of humor.

New Plan: rather than introduce the complexity of a custom dictionary, I found decent ways to hack the analysis config for Sudachi to get good output with the default dictionary. word_delimiter_graph is your friend. See more on Mediwiki.

In addition to being easier to implement and maintain, having a good config for both Kuromoji and Sudachi (with "better" Sudachi overriding "good enough" Kuromoji when both are available), will allow us to gracefully fall back to Kuromoji if porting Sudachi to OpenSearch 1.x turns out not to be possible or worth the effort, and upgrade back to Sudachi when we reach OpenSearch 2.x.

We could stick with Kuromoji until we reach OpenSearch 2.x, but I suggest we go with Sudachi now and try to port Sudachi to OpenSearch 1.x.

A patch with wiki-optimized configs for Sudachi and Kuromoji is forthcoming.

Change #1120964 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Custom Japanese Config with Kuromoji and Sudachi

https://gerrit.wikimedia.org/r/1120964

TJones renamed this task from Test and analyze Kuromoji Japanese language analyzer to Test and analyze Kuromoji & Sudachi Japanese language analyzers.Feb 19 2025, 7:46 PM
TJones updated the task description. (Show Details)

After discussion in today's Wednesday Meeting, I've changed the scope of this ticket to end with merging the updated config above.

We're going to hold off on enabling Sudachi until after the OpenSearch 1.x migration. We can then look into backporting the Sudachi plugin to OpenSearch 1.x (and if that doesn't work out, perhaps enabling Kuromoji in the meantime).

Work on an MLR model for Japanese will be postponed until after the OpenSearch migration.

I'll open a couple more tickets related to Sudachi to cover the remaining work.

Change #1124183 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Explicitly Declare icu_normalizer Char Filter

https://gerrit.wikimedia.org/r/1124183

Change #1120964 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Custom Japanese Config with Kuromoji and Sudachi

https://gerrit.wikimedia.org/r/1120964

Change #1124183 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Explicitly Declare icu_normalizer Char Filter

https://gerrit.wikimedia.org/r/1124183