Page MenuHomePhabricator

Review indic_normalization for other Indic languages/scripts
Open, HighPublic

Description

The filter indic_normalization is used in the Hindi and Bengali analysis chains, but the code specifically mentions other scripts, and we should evaluate adding it to languages that use those scripts.

I ended up looking at the code for indic_normalization, and it is definitely pretty complex. It also explicitly calls out several scripts: Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Odia, Tamil, and Telugu. It would definitely make sense to test whether it is generally useful on languages using those scripts. (It is already used as part of the Bengali and Hindi analyzers. In Bengali it didn't seem to cause any problems, but we didn't necessarily look at a significant amount of text in Gujarati, Kannada, or Tamil scripts when reviewing the Bengali analyzer.)

Event Timeline

TJones triaged this task as High priority.Sep 24 2024, 9:57 PM
TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.