Page MenuHomePhabricator

Handle variation in apostrophe-like characters better
Closed, ResolvedPublic3 Estimated Story Points

Description

User Story: As an on-wiki searcher, I want to be able to search for words that have apostrophes in them without having to know or worry about what apostrophe-like character is actually used. For example, at least seven different characters are used on various projects in the name of the city in Yemen: Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib.

Notes: We have a new character filter, apostrophe_norm, currently configured for use only on Nias Wikipedia, which converts the other six options to the straight apostrophe.

There is a lot of cross-wiki inconsistency in how these characters are treated, too. The table below shows how the characters are analyzed in English, Japanese, and French Wikis. The standard tokenizer splits on backticks (` U+0060) so that always gets split into two words (ma is a stop word in French, so it gets dropped).

English has the aggressive_splitting filter enabled, which splits on three of the other characters (left and right curly apostrophes and the straight apostrophe). icu_folding removes the left and right half rings in English and French, though French has the "preserve" variant, which keeps the original, too. icu_folding also straightens the curly apostrophes in French, but aggressive_splitting has already split on them in English.

charU+0027U+02BFU+02BEU+02BCU+0060U+2019U+2018
inputMa'ribMaʿribMaʾribMaʼribMa`ribMa’ribMa‘rib
enma, ribmaribmaribmaribma, ribma, ribma, rib
jama'ribmaʿribmaʾribmaʼribma, ribma’ribma‘rib
frma'ribmarib/maʿribmarib/maʾribma'rib(ma,) ribma'rib/ma’ribma'rib/ma‘rib

If we work on T219108, we should also consider removing apostrophes from aggressive_splitting.

Acceptance Criteria:

  • apostrophe_norm is enabled everywhere (or at least by default, possibly with exceptions or customization for some languages for reasons as yet unknown)
  • All of Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib index to the same form in all or almost all wikis (i.e., with intentional exceptions).

Note: this is a follow up to T311654, which looked at this issue for just one language (Nias).

Event Timeline

@TJones: Would this be about CirrusSearch code, or where would this be located?

MPhamWMF triaged this task as Medium priority.Aug 15 2022, 3:26 PM
MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.

Change 927785 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Merge Apostrophe-Like Characters for All Languages

https://gerrit.wikimedia.org/r/927785

Full write up on MediaWiki.

Highlights:

  • The final set of 19 apostrophe-like characters to be normalized to apostrophes is [`´ʹʻʼʽʾʿˋ՚׳‘’‛′‵ꞌ'`].
  • Enabling the new apostrophe_norm makes new matches on lots of names and English, French, & Italian words.
  • Lots of matches in the local language, too, for some languages.
  • Uzbek searchers really like to mix it up with their apostrophe-like options. The apostrophe form o'sha will now match o`sha, oʻsha, o‘sha, o’sha, o`sha, oʻsha, o‘sha, and o’sha—all of which exist in my samples!

After the patch is merged, we still need to reindex to see these benefits. Since we need to reindex everything, it's best to wait a while and pick up more than one harmonization update when reindexing.

Change 927785 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Merge Apostrophe-Like Characters for All Languages

https://gerrit.wikimedia.org/r/927785