Page MenuHomePhabricator

Map modifier letter apostrophes to straight or curly quotes in the French Elasticsearch analysis chain
Closed, ResolvedPublic

Description

While looking into enabling ICU folding for French in T146402, I discovered that rarely, a"modifier letter apostrophe" is used instead of a straight quote or curly right quote in French Wikipedia articles. Before ICU folding they were being indexed wrong; with ICU folding they will be indexed wrong, but in a different way.

This is particularly relevant for French when elision occurs (d'un, d'en, l'idée), as the articles or other bits that glom onto the word through elision can't be stripped properly.

We should add a char filter to map modifier letter apostrophes (ʼ U+02BC) to straight or curly right single quotes (' U+0027 or ’ U+2019), and test appropriately.

Out of ~10K French Wikipedia articles, this affects 5 tokens (1 each of 5 types)—so it’s not a big problem, but it would be nice to deal with it automatically.

Event Timeline

debt triaged this task as Medium priority.Sep 30 2016, 7:46 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Change 314569 had a related patch set uploaded (by Tjones):
Map modifier letter apostrophes to straight quotes for French

https://gerrit.wikimedia.org/r/314569

Change 314569 merged by jenkins-bot:
Map modifier letter apostrophes to straight quotes for French

https://gerrit.wikimedia.org/r/314569