Maniphest T325089

Unpack Armenian, Latvian, Hungarian Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Dec 13 2022, 6:53 PM

Tags

Referenced Files

None

Subscribers

Description

See parent task for details.

(These were chosen next more or less at random.)

Details

	Subject	Repo	Branch	Lines +/-
	Unpack Armenian, Latvian, Hungarian Analyzers	mediawiki/extensions/CirrusSearch	master	+451 -42

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T219550 [EPIC] Harmonize language analysis across languages
		Resolved		Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
		Resolved		TJones	T325089 Unpack Armenian, Latvian, Hungarian Elasticsearch Analyzers
		Resolved		TJones	T327801 Reindex Armenian, Latvian, Hungarian wikis to enable unpacked analyzers

Event Timeline

TJones created this task.Dec 13 2022, 6:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2022, 6:53 PM

TJones triaged this task as High priority.Dec 13 2022, 6:54 PM

TJones added a parent task: T272606: [EPIC] Unpack all Elasticsearch analyzers.

TJones set the point value for this task to 5.

TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Dec 13 2022, 7:02 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Dec 13 2022, 7:22 PM

TJones moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Jan 13 2023, 9:28 PM

TJones claimed this task.Jan 17 2023, 6:38 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 882262 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Armenian, Latvian, Hungarian Analyzers

https://gerrit.wikimedia.org/r/882262

gerritbot added a project: Patch-For-Review.Jan 23 2023, 9:53 PM

Full notes on Mediawiki.

Highlights:

Armenian Wikipedia sample has quite a few Latin and Cyrillic tokens—not shocking.
ICU Normalization converts Armenian և to եւ, which seems reasonable, but it interfered with a couple of stop words. A new stop word filter solved that.
Latvian had more than the usual number of Latin/Cyrillic homoglyph tokens, so yay for homoglyph_norm!
Hungarian and Latvian didn't need any customization and I just had to add a fallthrough case to the switch statement in the config code! Yay for refactoring code!

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Jan 23 2023, 10:52 PM

Change 882262 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Armenian, Latvian, Hungarian Analyzers

https://gerrit.wikimedia.org/r/882262

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.21; 2023-01-30).Jan 24 2023, 11:00 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2023, 11:31 AM

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Jan 24 2023, 3:57 PM

TJones mentioned this in T327801: Reindex Armenian, Latvian, Hungarian wikis to enable unpacked analyzers.Jan 24 2023, 4:39 PM

TJones moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Feb 6 2023, 4:27 PM

Gehel closed this task as Resolved.Feb 10 2023, 4:14 PM

Gehel closed subtask T327801: Reindex Armenian, Latvian, Hungarian wikis to enable unpacked analyzers as Resolved.Feb 17 2023, 2:55 PM