Page MenuHomePhabricator

Test Elastic 7.10 language analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story: As a search engineer I don't want there to be any big language analysis surprises when we upgrade from Elasticsearch 6.8 to 7.10.

We can do a relatively quick check on 500–1000 random documents from a selection of Wikipedias and Wiktionaries, to test language-specific analysis and a variety of scripts. I will also test some additional "rare" characters (such as ♙, ☥, 〃, and 〆—see T211824: Investigate a “rare-character” index and 6cac1cbae6c4).

If there are no big issues, it should be relatively quick. If there are any big issues, well then it'll be a good thing we found them. (We really want to detect problems like the Chinese punctuation problem in T172653—though admittedly that was not caused by an upgrade.)

We are planning a similar analysis from ES 6.5 to ES 6.8. (See T300302). Review notes from 6.5 to 6.8 upgrade for potential tokenizer issues, especially "Next Steps" / "Start fixing stuff" list.

Acceptance Criteria:

  • Report on language analysis diffs between ES 6.5 & 6.8 and ES 7.10
  • New phab tickets for any big issues that need to be addressed

Event Timeline

TJones set the point value for this task to 3.Feb 7 2022, 5:01 PM

Updated story points and task description based on experience with T300302.

Summary:

  • There are no changes to most analyzers between 6.8 and 7.10.
  • The most impactful (and most debatable) changes to the Nori (Korean) tokenizer made between 6.5 and 6.8 have been reverted (keeping the smaller, better changes).
  • The Thai tokenizer now allows some less commonly used Unicode characters through, where before it would delete/ignore them.
  • The problem of narrow non-breaking spaces (NNBSP) that existed in the 6.5 ICU tokenizer and that was introduced in the 6.8 standard tokenizer persists, so I'm going to patch it.

My full write up is on Mediawiki.

I should have a patch with the NNBSP fix tomorrow.

Change 825926 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enabled Global Filters for Narrow No-Break Space

https://gerrit.wikimedia.org/r/825926

Thanks for the write-up! Leaving the ticket in Needs review for a couple days so that others have a chance to review it as well.

Change 825926 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enabled Global Filters for Narrow No-Break Space

https://gerrit.wikimedia.org/r/825926