Page MenuHomePhabricator

At Vietnamese wikis, Special:Search should not redirect based on case-folding
Closed, ResolvedPublic

Description

At the Vietnamese Wiktionary, searching for “trường hộp” redirects to “trường hợp”, which is incorrect and potentially confusing to readers (because they might not notice the circumflex being replaced by a horn). At Vietnamese wikis, the search engine should perform case folding only for search suggestions, results, and Did You Mean; it should never redirect the user to a page that only matches due to case-folding. (There is one case where this behavior is useful: things like “xóa” and “xoá” are interchangeable. But we already have redirect pages for all these cases.)

The impact on Vietnamese wikis is high because most words have completely unrelated lookalikes when ignoring diacritics.

Event Timeline

mxn renamed this task from Special:Search should not conflate diacritics at Vietnamese wikis to At Vietnamese wikis, Special:Search should not redirect based on case-folding.
mxn raised the priority of this task from to Needs Triage.
mxn updated the task description. (Show Details)
mxn added a project: Wikimedia-Site-requests.
mxn changed Security from none to None.
mxn subscribed.

Just some examples to illustrate the severity of this issue:

  • Searching for “bác bỏ” (abandonment) takes you to “bắc bộ” (northern region), which redirects to “Bắc Bộ Việt Nam” (Northern Vietnam). If you’re a reader unfamiliar with MediaWiki, this may look like a political statement to you.
  • Searching for “khóa học” (academic course) takes you to “khoa học” (science). If you’re a reader unfamiliar with ElasticSearch’s ~ operator, it seems impossible to use the search bar to find information on academic offerings at universities.
  • Searching for “truyền thống” (tradition) takes you to “truyền thông” (communication). If you’re the same reader as above, it seems impossible to find information on traditions, and it’s kind of insulting that the site takes you to something random instead.

Of course, searching for “bác bỏ” wouldn’t take you to “Bắc Bộ Việt Nam” if the Vietnamese Wikipedia had an article on “bác bỏ”, but there are so many potential cases for confusion that the 40-some active editors cannot possibly write away the problem.

I’m considering working around this issue at the Vietnamese Wikipedia with a gadget that prepends ~ to any search from the search box that contains Vietnamese diacritics. But it’s a sledgehammer, and I’d much prefer to get proper language support into ElasticSearch or to turn diacritic folding off entirely.

TJones raised the priority of this task from Low to Needs Triage.May 2 2023, 3:39 PM
TJones added a project: Discovery-Search.
Gehel triaged this task as High priority.May 8 2023, 3:31 PM
Gehel moved this task from needs triage to Language Stuff on the Discovery-Search board.
TJones claimed this task.
TJones subscribed.

I think this is resolved, as a side effect of specifying what is and is not ICU foldable in Vietnamese as part of T332342: Standardize ASCII-folding/ICU-folding across analyzers.

The current state is not exactly what was requested in the description—which is never redirecting to a page that only matches due to diacritic-folding—but rather what is generally considered best practice across wikis/languages: not to fold and match diacritics that are relevant/native to the language of the wiki.

So, trường hộp no longer redirects to trường hợp (because ộ and ợ are meaningfully distinct in Vietnamese.

However, queries with non-Vietnamese diacritics can have those diacritics ignored for redirect matching (when things are not ambiguous). përe and pēre redirect to pere; uder and ūder redirect to üder (because there is no competition, i.e., no entry for uder or ūder) Non-Vietnamese diacritics on Vietnamese words can also be ignored: trườñg hợp redirects to trường hợp because the tilde on that n came out of nowhere!

Ambiguous folded matches don't redirect and still go to fulltext results. For example, bäs doesn't have an exact match, but matches bas, baš, baş, bås, bäş, ba̮š, ɓas, and baʂ (all with non-Vietnamese diacritics)—so you just get the fulltext results.

Agreed, this is an ideal outcome with respect to wikis that have Vietnamese as the content language. Thank you! Case folding is probably unavoidably more aggressive at a non-Vietnamese Wiktionary, but Wiktionaries tend to include hatnotes or other navigation aids to homoglyphic titles anyways.