Page MenuHomePhabricator

Unpack Romanian, Sorani Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

See parent task for details.

(These were chosen next more or less at random.)

[Spun off Turkish because it became more complicated.]

Event Timeline

TJones triaged this task as High priority.Dec 13 2022, 6:58 PM
TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.
TJones set the point value for this task to 5.

Found a problem with the apostrophe filter for Turkish, which is very aggressive and does bad things to French and Italian (which are common in names, sources, etc.). For example, d'Onofrio'nun, d'administration, d'administration'dan, and d'Arthur'unda all get indexed as plain d. Not optimal.

I've come up with a bunch of heuristics that improve the apostrophe processing. Implementing them as a collection of existing filters is a mess, so making a plugin seems like a good approach—it also makes the logic more easily reusable by others.

I'm going to spin off Turkish as its own ticket and finish up the other two first.

TJones renamed this task from Unpack Romanian, Turkish, Sorani Elasticsearch Analyzers to Unpack Romanian, Sorani Elasticsearch Analyzers.Feb 15 2023, 4:32 PM
TJones updated the task description. (Show Details)

Full notes on Mediawiki.

  • Sorani was pretty straightforward.
    • ICU normalization cleaned up a lot of initial/medial/final form Arabic letters.
    • ICU folding had more impact on Latin than Arabic-script tokens, but plenty of both.
  • Romanian got interesting!
    • There's often confusion of s/t with cedilla (ş & ţ) and the more correct forms with comma (ș & ț), for historical encoding/support reasons. Added a filter to correct for it, and it had a nice effect.
    • However it decreased the number of stop words filtered, because they all use the incorrect/older cedilla forms! Enabled additional stop words with the comma forms and ~3.4% of terms in the Wikipedia sample are the Romanian word for "and" (și) and should be filtered as stop words!
    • I'll contact the stop word list creator and open a ticket and/or pull request upstream for Lucene to include both forms.

Patch forthcoming.

Change 891867 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Romanian and Sorani Analyzers

https://gerrit.wikimedia.org/r/891867

Change 891867 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Romanian and Sorani Analyzers

https://gerrit.wikimedia.org/r/891867