Page MenuHomePhabricator

Investigate effect of phonetic search on Wikipedia title words
Closed, ResolvedPublic

Description

The plan is to get a big sample of Wikipedia and Wiktionary titles and see what the various phonetic matching options available in Elasticsearch do to them, and see if any are clearly better or worse than others.

Event Timeline

TJones triaged this task as Medium priority.Dec 12 2017, 7:29 PM
TJones created this task.

Apologies for the long delay in getting this analysis done. My full write up is on MediaWiki.

Summary:

  • The different phonetic algorithms vary widely in terms of how aggressive they are about grouping more-or-less similar words together, in part because they have different goals, and were created at different times (the oldest going back ~100 years!).
  • They generally have difficulty with words with numbers, short words, and words with diacritics. Most of these problems can be ameliorated with additional filters in the analysis chain.
  • Double Metaphone seems like the best of the bunch for the next phase of testing.

Next Steps (see T184771):

  • Implement and test an analysis chain with the obvious precision mitigation steps: character folding, stop word filtering, filtering words with numbers, and word length filtering.
  • Set up a RelForge test bed for English Wikipedia and Wiktionary with a keyword or other implementation for phonetic title/redirect search or other implementation method, and invite feedback via the Village Pump, mailing lists, etc.
    • Depending on feedback, possibly adjust encoding, code length, etc., and test again.
  • If all goes well, make a deployment plan—using a keyword, or making it a component of standard search (which then has to interact with Learn-to-Rank), etc.