Page MenuHomePhabricator

Install and unpack Estonian analyzer
Closed, ResolvedPublic8 Estimated Story Points

Description

User Story: As a user of Estonian-language wikis, I want to have better Estonian language analysis so I see better search results (particularly, better recall).

Elasticsearch provides a Estonian language analyzer, but we don't currently use it for Estonian-language projects. We should enable it, have the performance verified by speakers, and then unpack it.

Acceptance Criteria:

  • Estonian speakers verify reasonable performance of the stemmer
  • Unpacked analyzer performs the same as the monolithic version (without general upgrades).
  • Upgraded analyzer either has no unexpected impact (we know what to expect from ICU norm and homoglyph norm, for example), or the impact is reviewed by a speaker of the language.

Event Timeline

TJones set the point value for this task to 8.Mar 16 2023, 3:33 PM

Change 912389 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enable and Unpack Estonian Analyzer

https://gerrit.wikimedia.org/r/912389

The new Estonian analyzer looks good, and enabling it has a big impact:

  • 4.372% of Wiktionary tokens (532 distinct, including case variants) and 13.146% of Wikipedia tokens (987 distinct) were filtered as stop words.
  • The merges from stemming were quite significant (even more so for Wikipedia):
    • Estonian Wiktionary: 11,044 tokens (6.912% of tokens) were merged into 2,487 groups (2.895% of groups)
    • Estonian Wikipedia: 497,501 tokens (22.675% of tokens) were merged into 23,075 groups (7.373% of groups)

Unpacking and upgrading the analyzer after enabling it gave the usual results. My samples had a lot of German ß's and a fair number of Latin/Cyrillic homoglyphs. Note that there shouldn't be any significant change in production for the unpacking and upgrading, since the upgrades are currently enabled in the default analyzer, which Estonian currently uses. The new analyzer should have a big impact, though!

Change 912389 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enable and Unpack Estonian Analyzer

https://gerrit.wikimedia.org/r/912389