Handle variation in apostrophe-like characters better
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Aug 12 2022, 9:11 PM

Description

User Story: As an on-wiki searcher, I want to be able to search for words that have apostrophes in them without having to know or worry about what apostrophe-like character is actually used. For example, at least seven different characters are used on various projects in the name of the city in Yemen: Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib.

Notes: We have a new character filter, apostrophe_norm, currently configured for use only on Nias Wikipedia, which converts the other six options to the straight apostrophe.

There is a lot of cross-wiki inconsistency in how these characters are treated, too. The table below shows how the characters are analyzed in English, Japanese, and French Wikis. The standard tokenizer splits on backticks (` U+0060) so that always gets split into two words (ma is a stop word in French, so it gets dropped).

English has the aggressive_splitting filter enabled, which splits on three of the other characters (left and right curly apostrophes and the straight apostrophe). icu_folding removes the left and right half rings in English and French, though French has the "preserve" variant, which keeps the original, too. icu_folding also straightens the curly apostrophes in French, but aggressive_splitting has already split on them in English.

char	U+0027	U+02BF	U+02BE	U+02BC	U+0060	U+2019	U+2018
input	Ma'rib	Maʿrib	Maʾrib	Maʼrib	Ma`rib	Ma’rib	Ma‘rib
en	ma, rib	marib	marib	marib	ma, rib	ma, rib	ma, rib
ja	ma'rib	maʿrib	maʾrib	maʼrib	ma, rib	ma’rib	ma‘rib
fr	ma'rib	marib/maʿrib	marib/maʾrib	ma'rib	(ma,) rib	ma'rib/ma’rib	ma'rib/ma‘rib

If we work on T219108, we should also consider removing apostrophes from aggressive_splitting.

Acceptance Criteria:

apostrophe_norm is enabled everywhere (or at least by default, possibly with exceptions or customization for some languages for reasons as yet unknown)
All of Ma'rib, Maʿrib, Maʾrib, Maʼrib, Ma`rib, Ma’rib, Ma‘rib index to the same form in all or almost all wikis (i.e., with intentional exceptions).

Note: this is a follow up to T311654, which looked at this issue for just one language (Nias).

Details

	Subject	Repo	Branch	Lines +/-
	Merge Apostrophe-Like Characters for All Languages	mediawiki/extensions/CirrusSearch	master	+2 K -390

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T219550 [EPIC] Harmonize language analysis across languages
		Resolved		TJones	T315118 Handle variation in apostrophe-like characters better

Event Timeline

TJones created this task.Aug 12 2022, 9:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2022, 9:11 PM

@TJones: Would this be about CirrusSearch code, or where would this be located?

MPhamWMF added a project: CirrusSearch.Aug 15 2022, 2:55 PM

MPhamWMF triaged this task as Medium priority.Aug 15 2022, 3:26 PM

MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.

TJones mentioned this in T311654: Apostrophes do not work well in search on nia.wikipedia.Aug 29 2022, 2:38 PM

TJones updated the task description. (Show Details)

RhinosF1 subscribed.Aug 29 2022, 3:30 PM

TJones added a parent task: T219550: [EPIC] Harmonize language analysis across languages.Mar 6 2023, 6:34 PM

TJones mentioned this in T219550: [EPIC] Harmonize language analysis across languages.Mar 6 2023, 7:11 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Mar 16 2023, 6:19 PM

MPhamWMF set the point value for this task to 3.Apr 10 2023, 3:50 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones claimed this task.May 16 2023, 8:28 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 927785 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Merge Apostrophe-Like Characters for All Languages

https://gerrit.wikimedia.org/r/927785

gerritbot added a project: Patch-For-Review.Jun 9 2023, 7:24 PM

Full write up on MediaWiki.

Highlights:

The final set of 19 apostrophe-like characters to be normalized to apostrophes is [`´ʹʻʼʽʾʿˋ՚׳‘’‛′‵ꞌ＇｀].
Enabling the new apostrophe_norm makes new matches on lots of names and English, French, & Italian words.
Lots of matches in the local language, too, for some languages.
Uzbek searchers really like to mix it up with their apostrophe-like options. The apostrophe form o'sha will now match o`sha, oʻsha, o‘sha, o’sha, o`sha, oʻsha, o‘sha, and o’sha—all of which exist in my samples!

After the patch is merged, we still need to reindex to see these benefits. Since we need to reindex everything, it's best to wait a while and pick up more than one harmonization update when reindexing.

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Jun 12 2023, 3:18 PM

Change 927785 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Merge Apostrophe-Like Characters for All Languages

https://gerrit.wikimedia.org/r/927785

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2023, 8:30 AM

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Jun 26 2023, 3:13 PM

TJones moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Jul 10 2023, 3:08 PM

Gehel closed this task as Resolved.Jul 21 2023, 9:41 AM

TJones mentioned this in T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair.Jul 21 2023, 3:15 PM