Test Elastic 7.10 language analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Feb 7 2022, 2:42 PM

Description

User Story: As a search engineer I don't want there to be any big language analysis surprises when we upgrade from Elasticsearch 6.8 to 7.10.

We can do a relatively quick check on 500–1000 random documents from a selection of Wikipedias and Wiktionaries, to test language-specific analysis and a variety of scripts. I will also test some additional "rare" characters (such as ♙, ☥, 〃, and 〆—see T211824: Investigate a “rare-character” index and 6cac1cbae6c4).

If there are no big issues, it should be relatively quick. If there are any big issues, well then it'll be a good thing we found them. (We really want to detect problems like the Chinese punctuation problem in T172653—though admittedly that was not caused by an upgrade.)

We are planning a similar analysis from ES 6.5 to ES 6.8. (See T300302). Review notes from 6.5 to 6.8 upgrade for potential tokenizer issues, especially "Next Steps" / "Start fixing stuff" list.

Acceptance Criteria:

Report on language analysis diffs between ES 6.5 & 6.8 and ES 7.10
New phab tickets for any big issues that need to be addressed

Details

	Subject	Repo	Branch	Lines +/-
	Enabled Global Filters for Narrow No-Break Space	mediawiki/extensions/CirrusSearch	master	+1 K -75

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved	Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved	None	T283275 Make MW master tests pass on PHP 8.0
Resolved	Reedy	T268861 CirrusSearch uses Elastica's Match class
Resolved	Reedy	T268863 Translate uses Elastica's Match class
Resolved	matthiasmullie	T268866 WikibaseMediaInfo uses Elastica's Match class
Invalid	None	T268864 WikibaseCirrusSearch uses Elastica's Match class
Resolved	Reedy	T268865 WikibaseLexemeCirrusSearch uses Elastica's Match class
Resolved	EBernhardson	T271777 Bump rufin/elastica (and related libraries) to versions that support PHP 8.0
Resolved	Gehel	T263142 [EPIC] Upgrade Elasticsearch to version 7.10
Resolved	TJones	T301131 Test Elastic 7.10 language analyzers
Resolved	EBernhardson	T317200 Reindex all wikis to fix nnbsp regression

Event Timeline

TJones created this task.Feb 7 2022, 2:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 7 2022, 2:42 PM

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Feb 7 2022, 4:35 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones set the point value for this task to 3.Feb 7 2022, 5:01 PM

TJones moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones added a parent task: T263142: [EPIC] Upgrade Elasticsearch to version 7.10.Feb 16 2022, 8:16 PM

TJones changed the point value for this task from 3 to 5.

TJones mentioned this in T300302: Test Elastic 6.8 language analyzers.

Updated story points and task description based on experience with T300302.

dcausse mentioned this in T308676: Elasticsearch 7.10.2 rollout plan.May 18 2022, 2:27 PM

TJones claimed this task.Aug 16 2022, 4:32 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Summary:

There are no changes to most analyzers between 6.8 and 7.10.
The most impactful (and most debatable) changes to the Nori (Korean) tokenizer made between 6.5 and 6.8 have been reverted (keeping the smaller, better changes).
The Thai tokenizer now allows some less commonly used Unicode characters through, where before it would delete/ignore them.
The problem of narrow non-breaking spaces (NNBSP) that existed in the 6.5 ICU tokenizer and that was introduced in the 6.8 standard tokenizer persists, so I'm going to patch it.

My full write up is on Mediawiki.

I should have a patch with the NNBSP fix tomorrow.

Change 825926 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enabled Global Filters for Narrow No-Break Space

https://gerrit.wikimedia.org/r/825926

gerritbot added a project: Patch-For-Review.Aug 23 2022, 11:41 PM

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Aug 24 2022, 7:14 PM

Thanks for the write-up! Leaving the ticket in Needs review for a couple days so that others have a chance to review it as well.

dcausse moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Sep 5 2022, 3:07 PM

Change 825926 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enabled Global Filters for Narrow No-Break Space

https://gerrit.wikimedia.org/r/825926

Maintenance_bot removed a project: Patch-For-Review.Sep 6 2022, 8:30 PM

TJones mentioned this in T317200: Reindex all wikis to fix nnbsp regression.Sep 7 2022, 2:31 PM

Gehel closed this task as Resolved.Sep 26 2022, 10:24 AM

Gehel closed subtask T317200: Reindex all wikis to fix nnbsp regression as Resolved.Nov 7 2022, 3:44 PM

Test Elastic 7.10 language analyzersClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Test Elastic 7.10 language analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...