Page MenuHomePhabricator

Include phonetic search option to advanced / power user search
Open, LowPublic

Description

(This probably belongs a bit deeper in the task hierarchy under more general advanced / power user search, but I can't find a ticket for that.)

While phonetic or "sounds like" search is typically very course-grained, and can often be (expensively) emulated with a regex, it could still be a useful feature for power users. The prototypical use is casting a wide net for matching names.

Elasticsearch already supports major phonetic searching algorithms, and their enthusiasm-crushing description on that page is actually quite fair. Phonetic search algorithms can be very language-specific (esp. since many are developed for English and English spelling is a travesty and a tragedy).

We could also support more than one type of phonetic searching, especially for transliteration. @santhosh pointed me to an interesting library, "Indic Soundex" (description, GitHub) for cross-script/cross-language searching in Indic languages.

We can talk about use cases (searching names, searching transliterations, lack of keyboard availability, etc.), implementation details (new indexes, slow regexes, a custom plugin), which algorithms to support, and how to make them available to searchers.

Event Timeline

debt subscribed.

This sounds interesting, but it might also create more problems than it solves. I think more investigation needs to be made into this: use cases, etc before deciding to do this. It might help with things like 'charlie' vs 'charly' ...maybe! :)

Charlie and Charly would be required to match at a minimum!

This might also make sense for hiragana/katakana mappings. See T176197.

One implementation that I discussed with David would be to only do phonetic indexing of titles. So phonetic:kluni would really mean phonetic+intitle:kluni.

A clever implementation could map different character sets differently. So hiragana matches katakana, and Latin names match each other, and Indic scripts match each other. A very clever implementation could try to map them all to some single representation, but that might be an implementation that is too clever for its own good. Not sure.

TJones lowered the priority of this task from Medium to Low.Aug 27 2020, 7:52 PM
MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

RhinosF1 removed a project: Discovery-Search.
RhinosF1 subscribed.

Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham.