Page MenuHomePhabricator

Port Sudachi to OpenSearch 1.x
Closed, ResolvedPublic5 Estimated Story Points

Description

After completing the analysis and configuration for T318269: Test and analyze Kuromoji & Sudachi Japanese language analyzers, we decided that we are too far along into the OpenSearch migration to enable Sudachi for Elasticsearch; it would cause too many disruptions and complications.

Instead we will try to enable Sudachi after we migrate to OpenSearch. Since our first stop will be OpenSearch 1.x, we will need to port Sudachi to OpenSearch 1.x ourselves. The original project currently supports Elastic and OpenSearch 2.6+, so hopefully it won't be too difficult.

Event Timeline

Gehel set the point value for this task to 5.Mar 3 2025, 4:39 PM

Imported the elasticsearch-sudachi repository to gitlab. A new v3.3.0-os1.3 branch has been added there which has an extra patch on top of the v3.3.0 release with the necessary adjustments to build against opensearch 1.3. For building I used java-1.17.0-openjdk with the command ./gradlew -PengineVersion=os:1.3.20 build. The resulting package has been uploaded to people.wikimedia.org.

Change #1125533 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/opensearch/plugins@master] Add sudachi analyzer for japanese

https://gerrit.wikimedia.org/r/1125533

Since there wasn't an MR it wasn't linked, but I imported the elasticsearch-sudachi repo to gitlab and have the appropriate code there:

new branch: v3.3.0-os1.3
implementation: patch

Change #1125533 merged by Tjones:

[operations/software/opensearch/plugins@master] Add sudachi analyzer for japanese

https://gerrit.wikimedia.org/r/1125533

Change #1126663 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/software/opensearch/plugins@master] Bump changelog version for sudachi analyzer

https://gerrit.wikimedia.org/r/1126663

RKemper subscribed.

We'll probably need a subtask now for doing a rolling restart to get this live

FYI, as part of T386870: Regression Test OpenSearch Language Analysis, I checked on the OS 1.3 port of Sudachi, and there were no changes in language analysis results. (I reran my Elastic baseline with the same version of Sudachi and the same dictionary, so everything was indeed identical.)

Thanks for doing the port, @EBernhardson!

Change #1126663 merged by Ebernhardson:

[operations/software/opensearch/plugins@master] Bump changelog version for sudachi analyzer

https://gerrit.wikimedia.org/r/1126663

Project has been ported, verified the 1.3.20-2 package on apt.wikimedia.org contains both the sudachi plugin and deps, along with the dictionary. A subtask has been opened to roll the restart on cloudelastic and relforge, and the other clusters will pick up the new package when they are reimaged.

Change #1128880 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch: use full paths for binaries

https://gerrit.wikimedia.org/r/1128880

Change #1128880 merged by Bking:

[operations/puppet@production] opensearch: use full paths for binaries

https://gerrit.wikimedia.org/r/1128880

Mentioned in SAL (#wikimedia-operations) [2025-03-18T14:56:35Z] <inflatador> bking@logstash1033 running puppet agent to confirm that CR 1128880 didn't cause problems T386868

Change #1128884 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch: symlink sudachi dir instead of dic file

https://gerrit.wikimedia.org/r/1128884

Change #1128884 merged by Bking:

[operations/puppet@production] opensearch: symlink sudachi dir instead of dic file

https://gerrit.wikimedia.org/r/1128884

Change #1129284 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch: fix logic for creating sudachi symlink

https://gerrit.wikimedia.org/r/1129284

Change #1129284 merged by Bking:

[operations/puppet@production] opensearch: fix logic for creating sudachi symlink

https://gerrit.wikimedia.org/r/1129284

This is currently deployed to cloudelastic. It will make it to the other prod clusters when they migrate to opensearch. Calling this done.