Page MenuHomePhabricator

Unpack Turkish Analyzer and improve apostrophe handling
Closed, ResolvedPublic5 Estimated Story Points

Description

Unpack Turkish Elastic analyzer and implement apostrophe improvements in a plugin for simplicity, performance, and reusability by others.

The new plugin will need to be built and deployed, the config changes deployed, and the Turkish-language wikis reindexed.

Notes from https://phabricator.wikimedia.org/T325091#8619132

Found a problem with the apostrophe filter for Turkish, which is very aggressive and does bad things to French and Italian (which are common in names, sources, etc.). For example, d'Onofrio'nun, d'administration, d'administration'dan, and d'Arthur'unda all get indexed as plain d. Not optimal.

I've come up with a bunch of heuristics that improve the apostrophe processing. Implementing them as a collection of existing filters is a mess, so making a plugin seems like a good approach—it also makes the logic more easily reusable by others.

I'm going to spin off Turkish as its own ticket and finish up the other two first.

Event Timeline

TJones set the point value for this task to 5.

Change 898087 had a related patch set uploaded (by Tjones; author: Tjones):

[search/extra@master] Create better_apostrophe in extra-analysis-turkish⏎

https://gerrit.wikimedia.org/r/898087

Change 898087 merged by jenkins-bot:

[search/extra@master] Create better_apostrophe in extra-analysis-turkish

https://gerrit.wikimedia.org/r/898087

Change 901645 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Turkish Analyzer, enable better_apostrophe

https://gerrit.wikimedia.org/r/901645

Full write up with lots more details on Mediawiki.

Highlights:

  • Turkish uses apostrophes for inflecting proper names (e.g., Türkiye'den) and the Elastic/Lucene token filter for it is not smart and does bad things to French words (d'immortalite), Irish names (O'Connell), and others. The new plugin/filter (change 898087 above) is much better.
  • Everything else was pretty much by the numbers.

Change 901645 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Turkish Analyzer, enable better_apostrophe

https://gerrit.wikimedia.org/r/901645