See parent task for details.
(These were chosen next more or less at random.)
[Spun off Turkish because it became more complicated.]
See parent task for details.
(These were chosen next more or less at random.)
[Spun off Turkish because it became more complicated.]
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Unpack Romanian and Sorani Analyzers | mediawiki/extensions/CirrusSearch | master | +442 -21 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | TJones | T219550 [EPIC] Harmonize language analysis across languages | |||
Resolved | Gehel | T272606 [EPIC] Unpack all Elasticsearch analyzers | |||
Resolved | TJones | T325091 Unpack Romanian, Sorani Elasticsearch Analyzers | |||
Resolved | TJones | T330893 Map Romanian s&t with comma to cedilla internally | |||
Resolved | TJones | T330783 Reindex Romanian, Sorani wikis to enable unpacked analyzers |
Found a problem with the apostrophe filter for Turkish, which is very aggressive and does bad things to French and Italian (which are common in names, sources, etc.). For example, d'Onofrio'nun, d'administration, d'administration'dan, and d'Arthur'unda all get indexed as plain d. Not optimal.
I've come up with a bunch of heuristics that improve the apostrophe processing. Implementing them as a collection of existing filters is a mess, so making a plugin seems like a good approach—it also makes the logic more easily reusable by others.
I'm going to spin off Turkish as its own ticket and finish up the other two first.
Full notes on Mediawiki.
Patch forthcoming.
Change 891867 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Unpack Romanian and Sorani Analyzers
Change 891867 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Unpack Romanian and Sorani Analyzers