Page MenuHomePhabricator

Sudachi analysis chain fails on long emoji sequence
Closed, ResolvedPublic3 Estimated Story Points

Description

Attempting to reindex jawiki_general fails with:

{
  "index": "jawiki_general_1755293360",
  "type": "_doc",
  "id": "5112606",
  "cause": {
    "type": "execution_exception",
    "reason": "execution_exception: java.lang.IndexOutOfBoundsException: Index -1024 out of bounds for length 87503",
    "caused_by": {
      "type": "index_out_of_bounds_exception",
      "reason": "index_out_of_bounds_exception: Index -1024 out of bounds for length 87503"
    }
  },
  "status": 500
}

The referenced page contains essentially the same emoji repeated many times:

https://ja.wikipedia.org/wiki/%E5%88%A9%E7%94%A8%E8%80%85:%E4%B8%87%E6%AD%B3%E5%B8%9D%E5%9B%BD

We need to resolve whatever problem is going on there so that jawiki_general can pick up the updated analysis chain. This also means that content changes on wikis that use the japanese analysis chain may fail the indexing pipeline and be un-updatable.

Event Timeline

TJones changed the task status from Open to In Progress.Aug 18 2025, 10:38 PM

Change #1180192 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Break long sequences of characters for Sudachi

https://gerrit.wikimedia.org/r/1180192

I can generate the error with the right sequences of katakana, hiragana, emoji, Gothic, and Thai characters.. long sequences of multibyte characters seems to be a reliable source of errors. I never had errors under 8000 characters. Sometimes I got errors at 16000 characters, sometimes 32000 (depending on character type). It looks like the threshold for problems is often around 11,000 characters, but I didn't dig too deeply.

Oddly, 64000-character tokens reliably get broken up into blocks of 4096 characters (but 16000 and 32000 don't). I guess there some a threshold of clearly too long that triggers the chunking.

I suspect threre's a mismatch between counting characters and counting bytes, and/or problems with the OpenSearch-internal representations of multibyte characters (🤓 → \uD83E\uDD13, 𐌳 → \uD800\uDF33, etc.).

I've added a (hacky) char filter to look for sequences of over 8000 characters without a reliable Sudachi word boundary (\s or certain CJK punctuation characters) and add a space after those 8000 characters.

The longest sequences I could find in the wild (on jawiki and jawikisource) were in the ~800 character range (and those do all get broken up into normal, mostly 1–4 character tokens by the tokenizer), so 8000 should let almost all natural text go through unharmed.

A patch is up for review, I'll add some more details to my Sudachi notes on MediaWiki, and I'll open a ticket upstream.

Change #1180192 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Break long sequences of characters for Sudachi

https://gerrit.wikimedia.org/r/1180192

TJones triaged this task as High priority.Aug 21 2025, 5:24 PM
TJones set the point value for this task to 3.

I've started the reindex for this on all three clusters, wikidata and commonswiki will take a few days.