Review indic_normalization for other Indic languages/scripts
Open, HighPublic
Actions

Assigned To

None

Authored By

	TJones
	Sep 24 2024, 9:57 PM

Description

The filter indic_normalization is used in the Hindi and Bengali analysis chains, but the code specifically mentions other scripts, and we should evaluate adding it to languages that use those scripts.

I ended up looking at the code for indic_normalization, and it is definitely pretty complex. It also explicitly calls out several scripts: Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Odia, Tamil, and Telugu. It would definitely make sense to test whether it is generally useful on languages using those scripts. (It is already used as part of the Bengali and Hindi analyzers. In Bengali it didn't seem to cause any problems, but we didn't necessarily look at a significant amount of text in Gujarati, Kannada, or Tamil scripts when reviewing the Bengali analyzer.)

Event Timeline

TJones created this task.Sep 24 2024, 9:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2024, 9:57 PM

TJones triaged this task as High priority.Sep 24 2024, 9:57 PM

TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

Review indic_normalization for other Indic languages/scriptsOpen, HighPublicActions

Description

Event Timeline

Review indic_normalization for other Indic languages/scripts
Open, HighPublic
Actions