Page MenuHomePhabricator

Greek language analysis generates unexpected empty tokens
Closed, ResolvedPublic


While looking into T192502 (which looks at empty tokens created by ICU folding), I discovered that the monolithic Greek analyzer generates some empty tokens, too, particularly for these words: εστάτο, εστερ, εστέρ, έστερ, έστέρ, εστέρα, εστέρας, εστέρες, εστέρησε, εστερία, εστερικό, εστερικού, εστερικών, εστέρο, εστέρος, εστέρων, ήσανε, ότερ, οτέρι, ότερι, οτερό, οτέρο.

As a result, searching for any of them finds the others. Some are related, but as far as I can tell, searching for εστάτο (estáto) should not return articles with Εστέρες (estéres) and Οτερό (oteró) in the title as top hits—yet that's what happens!

A straightforward solution would be to unpack the Greek analyzer and add a filter for empty tokens. These words would no longer be conflated, and exact matches would still be available through the plain index.


Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : masterAdd Greek empty-token filter and keep lang-specific lowercasing

Event Timeline

TJones created this task.Aug 29 2018, 9:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2018, 9:03 PM
EBjune triaged this task as Medium priority.Aug 30 2018, 5:25 PM
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.
TJones moved this task from Up Next to later on... on the Discovery-Search board.Nov 13 2018, 6:47 PM
TJones claimed this task.Feb 26 2019, 4:48 PM
TJones moved this task from Language Stuff to Current work on the Discovery-Search board.

Change 494846 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

Unpacking the Greek analyzer exposes the lowercase filter, which is upgraded to icu_normalizer, losing the Greek-specific processing therein! So, we need to keep the Greek lowercasing even if we do ICU normalization. After that, everything is copacetic. Full write up on MediaWiki.

Change 494846 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

debt closed this task as Resolved.Mar 14 2019, 9:21 PM