User Story: As a user of a Japanese-language wiki, I'd like better language processing than overlapping bigrams. The Kuromoji or Sudachi analyzers might well be up to the task.
Japanese is a major language (13th most speakers) with a large Wikipedia (also 13th by article count), a robust on-wiki community (5th by active users), and high search volume (6th by unique queries). The language and its writing system are complex, and word segmentation is particularly challenging, but overall it is well-supported by modern NLP libraries, including ones available for Lucene (and thus Elasticsearch / OpenSearch), such as Kuromoji.
Nonetheless, we currently use a very simplistic approach to parsing Japanese, namely overlapping bigrams. In English, this would be not quite as bad as parsing statesman into st, ta, at, te, es, sm, ma, and an, searching on those bigrams, and trying not to be surprised when "International Politics and the Establishment of Presbyterianism in the Channel Islands" is returned as the top result.
The current scattershot bigram approach is much better than nothing (i.e., requiring exact string matches), but it is not very precise—which is why we have previously moved away from it for Chinese and Korean.
It's been a bit more than five years since we last looked at Kuromoji (T166731). In that time, it has probably gotten better, and I expect my ability to deal with shortcomings in analyzers has also gotten better.*
────────
* Experience is something you don't get until right after you need it.
Acceptance Criteria:
- A write up of findings on the Kuromoji + Sudachi analyzers
- Either...
- ...include reasons why Kuromoji / Sudachi are unacceptable in the write up, or
- ...a patch implementing the Kuromoji analyzer, the Sudachi analyzer, or both
Note: Updated task description to include Sudachi, and end with analysis changes, dropping the measurement criteria because we are going to delay deployment during the OpenSearch migration.