User Story: As an on-wiki searcher, I want to be able to search for words that have apostrophes in them without having to know or worry about what apostrophe-like character is actually used. For example, at least seven different characters are used on various projects in the name of the city in Yemen: Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib.
Notes: We have a new character filter, apostrophe_norm, currently configured for use only on Nias Wikipedia, which converts the other six options to the straight apostrophe.
There is a lot of cross-wiki inconsistency in how these characters are treated, too. The table below shows how the characters are analyzed in English, Japanese, and French Wikis. The standard tokenizer splits on backticks (` U+0060) so that always gets split into two words (ma is a stop word in French, so it gets dropped).
English has the aggressive_splitting filter enabled, which splits on three of the other characters (left and right curly apostrophes and the straight apostrophe). icu_folding removes the left and right half rings in English and French, though French has the "preserve" variant, which keeps the original, too. icu_folding also straightens the curly apostrophes in French, but aggressive_splitting has already split on them in English.
char | U+0027 | U+02BF | U+02BE | U+02BC | U+0060 | U+2019 | U+2018 |
input | Ma'rib | Maʿrib | Maʾrib | Maʼrib | Ma`rib | Ma’rib | Ma‘rib |
en | ma, rib | marib | marib | marib | ma, rib | ma, rib | ma, rib |
ja | ma'rib | maʿrib | maʾrib | maʼrib | ma, rib | ma’rib | ma‘rib |
fr | ma'rib | marib/maʿrib | marib/maʾrib | ma'rib | (ma,) rib | ma'rib/ma’rib | ma'rib/ma‘rib |
If we work on T219108, we should also consider removing apostrophes from aggressive_splitting.
Acceptance Criteria:
- apostrophe_norm is enabled everywhere (or at least by default, possibly with exceptions or customization for some languages for reasons as yet unknown)
- All of Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib index to the same form in all or almost all wikis (i.e., with intentional exceptions).
Note: this is a follow up to T311654, which looked at this issue for just one language (Nias).