Page MenuHomePhabricator

Unpack Armenian, Latvian, Hungarian Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

See parent task for details.

(These were chosen next more or less at random.)

Event Timeline

TJones triaged this task as High priority.Dec 13 2022, 6:54 PM
TJones set the point value for this task to 5.
TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

Change 882262 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Armenian, Latvian, Hungarian Analyzers

https://gerrit.wikimedia.org/r/882262

Full notes on Mediawiki.

Highlights:

  • Armenian Wikipedia sample has quite a few Latin and Cyrillic tokens—not shocking.
  • ICU Normalization converts Armenian և to եւ, which seems reasonable, but it interfered with a couple of stop words. A new stop word filter solved that.
  • Latvian had more than the usual number of Latin/Cyrillic homoglyph tokens, so yay for homoglyph_norm!
  • Hungarian and Latvian didn't need any customization and I just had to add a fallthrough case to the switch statement in the config code! Yay for refactoring code!

Change 882262 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Armenian, Latvian, Hungarian Analyzers

https://gerrit.wikimedia.org/r/882262