Page MenuHomePhabricator

Test and analyze new Kuromoji Japanese language analyzer
Closed, ResolvedPublic

Description

I'm going to skip the research spike since there's a strong recommendation from Elastic for the Kuromoji Japanese language analyzer, and unless I find a problem with it, that's the one we'd want to go with.

So, the plan is to test it, and analyze to see if it is better or not. If it is, we will file a task to deploy it. If not, we'll go back and try that research spike.

Event Timeline

Initial write up of language analyzer is here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Kuromoji_Analyzer_Analysis

Still need to get some Japanese speaker review, and test the tokenization more generally.

Quick summary: there are lots of options, unpacked version works much better for non-Japanese non-Latin text, full-width numbers are weird, but can be fixed with a char filter, and while it needs speaker review, there aren't a lot of crazy groups of words analyzed together.

I'm happy to help with speaker review. I take that I would be supposed to check if the frequencies of the different forms seem right or not. (e.g. if some unlikely forms appears more frequent than others.) Is this correct?

Also, if one or two sample sentences are available for each token, that would be helpful (especially for short hiragana-only tokens). It's sometimes not immediately obvious what a stand alone token means because I ( and I believe most Japanese speakers) tend to think in longer segments than morphemes.

@Rxy and @whym—thanks for offering to help!

The groups to review are in my notes on mediawiki.

The sample of random groups is probably the most important since it is the most representative. The largest groups and the groups with no common prefix/suffix are more likely to have problems, which is why I gathered them for inspection. If they have only minor problems, I will be much more confident in the analyzer overall.

The primary goal of the group review is to make sure that the words/tokens that are grouped together are reasonable. In most cases, if you searched for one of them and found the others, would that be reasonable? Sometimes reasonable groups overlap—because language is always messy—and that's acceptable. For example, in English, can either means "to be able to" or "a metal cylinder for holding food". The form cans definitely refers to metal cylinders, and the form can't is definitely the negative of the verb—but they might show up grouped together because they are both related to can. (This doesn't actually happen with the English analyzer, but many ambiguous forms in English overlap completely—fly, flies can be either noun or verb—so examples that only partially overlap are hard to find.)

Frequency is less important but still valuable. I usually use it to decide how important an error is. If two very rare words are linked incorrectly, it's a small problem. If two very common words are linked incorrectly, it's a bigger problem. Since tokenization is much harder in Japanese, checking for reasonable frequencies is a great idea, too—thanks!

I haven't tried to pull out example sentences before. It's more complicated for Japanese because I can't just search for the word in the text. I will try to add an option to my analysis code to pull out examples for particular tokens. If you can, please take a look at what's there and let me know if you think you need examples for everything, or just for some tokens.

I recognize that the context can help a lot for less concrete parts of speech like particles, but I'm worried that there could be confusion when there is a segmenting error when the rest of the analysis is correct. A forced example, in English, would be to segment important as import + ant. That's a mistake, but grouping import with imports, imported, importing and grouping ant with ants is correct. Right now I'm focused on the grouping, and will be testing the segmenting separately.

So, keep that in mind, and I'll work on finding example sentences in the text.

Thanks for your help!

Additional analysis on tokenization is available.

Highlights: There are some systematic differences between the tokenization in the test corpus and the tokenization done by Kuromoji. I manually corrected the bulk of the punctuation differences, since punctuation isn't indexed. Comparative accuracy is still not super high, but there are a lot of systematic differences—the 59 most common token types account for about 40% of all tokenization differences. There are plenty of obvious errors (e.g., the longest tokens found). Some of the tokenization differences may be based on how the different tokenizers treat grammatical particles, but I'm not sure about that because so many single-character words are so polysemous.

Overall, it's not great, but good enough to move on to setting up an instance in labs and asking for feedback on the Village Pump.

Sounds good, @TJones, let's go ahead and setup the instance and we can start crafting a message for posting on the Village Pump, with a little help from @CKoerner_WMF if he has time. :)

I haven't tried to pull out example sentences before. It's more complicated for Japanese because I can't just search for the word in the text. I will try to add an option to my analysis code to pull out examples for particular tokens.

I'm still working on this. It's turning out to be more complicated (and more necessary!) because a number of tokens are normalized differently in different contexts. It's not just that one token is generating multiple normalized forms. It seems to be context dependent. So, I need to be more careful about pulling out the right example sentences, and not just any sentence with that token.

I've pulled out up to three example sentences for each token in each group for review. Some tokens only occur once or twice, so they have fewer examples. There may be cases where multiple examples are the same; I haven't tried to filter them out.

It's also possible that some of the example sentences are cut off at the beginning or end if they are really long.

Since the list of example sentences is very large, I've put them on their own pages. They are linked in my notes, but here are the direct links:

  • Groups with no common prefix or suffix. This one also includes a more detailed explanation of the sample output format.
  • Largest groups, i.e., the ones with most distinct tokens.
  • 50 random groups. These are the ones that are most representative of the tokenization and stemming/grouping as a whole.

Thanks for providing additional samples! Here are some quick observations I had. Hopefully some of them are relevant.

Grouping

As noted, polysemy of some of the short tokens is an issue and I wonder how it is disambiguated (or not disambiguated at all?) in indexing. For example, Group of [68 た] - this appears like a group for たい ("want to"). If the keyword "たい" gives you search results containing "た", it would make sense only if the "た" is analyzed and indexed as the "た" belonging to that particular "たい" group. If the results are mixed with documents containing other "た" that should belong to other groups, that would be highly confusing. "た" can be a past particle (which should form its own one-element group because it doesn't have conjugation), for example. Again, people are not used to thinking in morphemes and conjugated morpheme forms - ambiguity like this would be less obvious than the ambiguity of "can", and could surprise the user more.

The rest (longer hiragana sequences, sequences containing a kanji) of the groups seem reasonable to me. From the random groups

[98 携わっ][3 携わら][23 携わり][48 携わる][1 携われ]

If I enter 携わる and get results for 携わり, that would seem helpful and would be something I'd expect from modern search systems.

Personally, I fear that the groups with no common prefix/suffix (something close to irregular inflection) might give bring more confusion than help to users when they are treated as the same thing, because many of 1 or 2-character hiragana sequences are rather highly ambiguous (more than "can" noun vs. "can" auxiliary verb, I'd say). I seem to be more familiar with search systems that stemmize more conservatively (basically, those that stemmize tokens containing at least one concrete word). That said, I might change my opinion after actually experiencing kuromoji-based search over Japanese wiki pages.

By the way, is it planned only for Japanese Wikipedia? I suspect Japanese Wiktionary is more multilingual and harder to do it right.

Segmentation

I agree that Kuromoji's segmentation appears to be roughly as good as KNCB's. Something similar to the difference between short and long units might be responsible for many of the discrepancies (=more about different granularity of analysis than about accuracy).

About tokens in context: I just briefly scanned it, and an obvious observation is that tokens in hiragana sequences in parentheses in https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Kuromoji_Analyzer_Analysis/sample/no_common (such as "く" in "彦六(ひころく)" and "ろ" in "(とうあんろえき)") are almost always wrongly segmented. I suspect Kuromoji is not trained to deal with them because these parenthesized hiragana expressions for showing readings are unnatural in normal texts - they are almost exclusively found in dictionaries and encyclopedias. https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Kuromoji_Analyzer_Analysis/sample/largest looks mostly good as for segmentation of the highlighted tokens. https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Kuromoji_Analyzer_Analysis/sample/random looks not bad but a bit mixed - "こうじ" in "加太こうじ、評論家" is not a form of the verb こうじる, but a personal name. All four samples for "てんじる" are wrongly segmented. (These errors again tend to be found in parenthesized hiragana readings.)

because a number of tokens are normalized differently in different contexts. It's not just that one token is generating multiple normalized forms. It seems to be context dependent.

Does this mean one token (surface form) can belong to more than one group - i.e. do groups disambiguate tokens? That sounds linguistically more correct, but I don't know if Kuromoji might simplify it into mutually exclusive groups for efficiency purposes.

To be fair, the "wrongly" segmented tokens might matter less because most of them looks less likely to be used (at least as a stand alone search keyword). I wouldn't normally search for "こうじる" (an uncommon spelling of "講じる") unless perhaps I'm looking for misspellings to correct.

@whym, thanks for all the great information!

Some random thoughts and information as I read through:

On た: I checked, and in my sample of 10,000 wikipedia articles, it is only indexed 68 times, always as たい. The character occurs 72,059 times in the corpus, though, so it is mostly ignored. I’ve noticed that the Kuromoji analyzer drops a lot of characters and just doesn’t index them. (This isn’t a disaster—we have “text” field with the analyzed text, but also the “plain” field, which is generally unchanged, so exact matches are always possible.)

As an example of text being dropped, I analyzed this sentence fragment, which has た in it, though た doesn’t get indexed. The characters in [square brackets] are not indexed. Running the text through Google translate, there don’t seem to be any egregious errors.

  • グレート [・]アトラクター [が] 数億光年 [に] 渡る宇宙 [の] 領域内 [にある] 銀河 [とそれが] 属する銀河団 [の] 運動 [に] 及ぼす影響 [の] 観測 [から] 推定 [されたものである。]

Personally, I fear that the groups with no common prefix/suffix (something close to irregular inflection) might give bring more confusion than help to users when they are treated as the same thing, because many of 1 or 2-character hiragana sequences are rather highly ambiguous (more than "can" noun vs. "can" auxiliary verb, I'd say).

I agree that these Japanese tokens can be so much more ambiguous! (I just like to give examples in English for those who are reading along with us.) I think a lot of the 1- and 2-character tokens are being dropped by Kuromoji in most instances, so there won’t be as much chance for confusion as you fear. I’ve done a quick additional analysis of all the 1- and 2-character tokens in the no common prefix/suffix section.

The short version is that the 2-character tokens aren’t super common as strings in the corpus, though many are indexed. Some of the 1-character tokens are extremely common—many with tens of thousands and a few with hundreds of thousands of occurrences. However, most are very rarely indexed (less than 95% of the time), and the ones that are more commonly indexed are somewhat rarer (occurring between ~1,000 and ~10,000 times). For more details check out the new section in my notes.

I might change my opinion after actually experiencing kuromoji-based search over Japanese wiki pages.

You can try it here! It’s a recent copy of the Japanese Wikipedia with only the index—so you can see results and snippets, but none of the links work.

By the way, is it planned only for Japanese Wikipedia? I suspect Japanese Wiktionary is more multilingual and harder to do it right.

The plan is to change the configuration so that wikis that have their primary language marked as Japanese get the new Japanese language analyzer, and that would include Japanese Wiktionary. Wiktionary is always a tough one because of the mix of languages and character sets.

The default Kuromoji analyzer would be a disaster for Wiktionary, because it just deletes non-CJK, non-Latin characters. However when unpacked into its constituent parts it behaves better. For the most part, non-CJK text should be left alone, except for a few differences in tokenization, like www.website.com being broken up or 4G being split into 4 and G. Usually the “plain” field and proximity handle these kinds of thing reasonably well.

If you are particularly worried, I could set up a copy of the Japanese Wiktionary index in labs, too.

• The link on short and long units makes a lot of sense. That’s pretty much what I was trying to get at. Thanks for the link!

• Interesting note on parentheses. I didn’t know about that. Cool!

Does this mean one token (surface form) can belong to more than one group - i.e. do groups disambiguate tokens?

Depends on exactly what you mean, but I think the answer is yes. Some analyzers can generate multiple tokens to index for one instance of a token. There’s an old joke in English about how unionized can either be un+ion+ized or union+ized, so in theory you could index it as both union and ion. The Hebrew analyzer does this because of the wild ambiguity in Hebrew. Kuromoji does not—every token got indexed once.

However, it does seem to take the same token and index it differently depending on context. Something that looks like a verb suffix right after something that looks like a verb might get omitted from the index, while in a different context it would get indexed.

That sounds linguistically more correct, but I don't know if Kuromoji might simplify it into mutually exclusive groups for efficiency purposes.

Not sure what happens inside the analyzer. The new section I’ve added on 1- and 2-character tokens shows some of the ways those tokens get indexed. I could dig up example sentences for any of them if you are interested.

Thanks so much for the detailed response, @whym. I’m happy to try to answer any more questions you have, too.

TL; DR: I think the general result here is that Kuromoji isn’t perfect—which is only reasonable—and it isn’t a disaster, so it’s worth it to post the link to labs index on the Village Pump and get more feedback.

Just FYI, we will post a message about the Japanese language analyzer and have just now requested translations for it: https://meta.wikimedia.org/wiki/User:CKoerner_(WMF)/New_Japanese_language_analyzer.

Just FYI, we will post a message about the Japanese language analyzer and have just now requested translations for it: https://meta.wikimedia.org/wiki/User:CKoerner_(WMF)/New_Japanese_language_analyzer.

We're not yet getting a lot of action on the translation. If anyone here could help, that would be great. I want to encourage feedback via the Village Pump!

Feedback from the Village Pump was generally negative. There were some problems with scoring and configuration in Labs, but even with that settled the results were often not as good, and often had lots of extra extraneous results. (Extra results probably would have been okay if better results ended up at the top of the list, but that didn't happen.)

It's possible that better scoring and weighting would give better results, but there's no simple, obvious fix to try, and careful tuning would require significant time and significant help from a fluent speaker. Since we weren't specifically trying to fix a problem with Japanese, just offering a potential improvement, it's okay to abandon this change.

We can come back to Kuromoji or another analyzer in the future if it offers better accuracy, or if we think it would fix a problem for the Japanese language wikis.

Since I didn't do the full research spike for Japanese, I'm going to research both Vietnamese and Japanese alternative analyzers as part of T170423. If anything useful comes up, I'll mention it here, and create a new ticket to investigate. I'm not expecting anything—Elastic seems to know where all the good community-contributed analyzers are and there's usually only one real contender—but I'll try to be thorough.

Thanks for the notes and insights, @TJones :)

Change 365251 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Configure Japanese Language Analysis with Kuromoji

https://gerrit.wikimedia.org/r/365251

I moved this from Done to Needs Review because I decided to commit my updated config for Kuromoji. If it is not installed, nothing happens—as is the current state. But if it is installed, this non-default config fixes two problems: inconsistent treatment of fullwidth numbers, and the removal of many[1] non-Japanese, non-Latin words.

[1] Many, as in Arabic, Armenian, Bengali, Devanagari, Georgian, Hangul, Hebrew, IPA, Mongolian, Myanmar, Thaana, Thai, and Tibetan. Cyrillic is mostly okay, but Greek is still weird. But better than before with all these being dropped.

Change 365251 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Configure Japanese Language Analysis with Kuromoji

https://gerrit.wikimedia.org/r/365251

Thanks for your work on this, @TJones, I know it was difficult. :-/

Thanks @debt—it was a learning experience!

Plus, I feel this further justifies the work that goes into testing the analyzers before we deploy them. In this case, it seemed mostly okay to me, with some reservations, but the speaker feedback was that it was pretty awful. Similarly, the Vietnamese analyzer is not going to get deployed (update to T170423 coming soon), even without speaker review. If we just slammed these into prod without the testing and analysis, it'd be okay sometimes, but other times it would make search much worse. On balance, it's worth it!