Page MenuHomePhabricator

Map modifier letter apostrophes to straight or curly quotes in the French Elasticsearch analysis chain
Closed, ResolvedPublic

Description

While looking into enabling ICU folding for French in T146402, I discovered that rarely, a"modifier letter apostrophe" is used instead of a straight quote or curly right quote in French Wikipedia articles. Before ICU folding they were being indexed wrong; with ICU folding they will be indexed wrong, but in a different way.

This is particularly relevant for French when elision occurs (d'un, d'en, l'idée), as the articles or other bits that glom onto the word through elision can't be stripped properly.

We should add a char filter to map modifier letter apostrophes (ʼ U+02BC) to straight or curly right single quotes (' U+0027 or ’ U+2019), and test appropriately.

Out of ~10K French Wikipedia articles, this affects 5 tokens (1 each of 5 types)—so it’s not a big problem, but it would be nice to deal with it automatically.

Event Timeline

TJones created this task.Sep 27 2016, 7:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 27 2016, 7:06 PM
debt triaged this task as Normal priority.Sep 30 2016, 7:46 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt moved this task from Up Next to Current work on the Discovery-Search board.Oct 4 2016, 5:22 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
TJones claimed this task.Oct 4 2016, 5:26 PM

Change 314569 had a related patch set uploaded (by Tjones):
Map modifier letter apostrophes to straight quotes for French

https://gerrit.wikimedia.org/r/314569

Change 314569 merged by jenkins-bot:
Map modifier letter apostrophes to straight quotes for French

https://gerrit.wikimedia.org/r/314569

debt closed this task as Resolved.Oct 21 2016, 7:26 PM