Apply ICU folding to more languages
Open, MediumPublic
Actions

Assigned To

None

Authored By

	TJones
	Sep 24 2024, 9:19 PM

Description

As a follow up to T332342: Standardize ASCII-folding/ICU-folding across analyzers, apply ICU folding appropriately to more languages.

Likely next candidates include those that remain in the top 90 languages in my list (by unique query volume), grouped here by script:

(Latin script) Afrikaans/af, Icelandic/is, Latin/la, Welsh/cy, Asturian/ast, Scots/sco, Luxembourgish/lb, Alemannic/als, Breton/br
(Cyrillic) Mongolian/mn, Macedonian/mk, Kyrgyz/ky, Belarusian/be, Belarusian-Taraškievica/be-tarask, Tajik/tg (cy/la)
(Arabic script) Urdu/ur, Kurdish/ku (ar/la)
(CJK) Cantonese/zh-yue

To finish off languages with Wikipedias with 100,000 or more articles, we'd need to cover these, too:

(Latin script) Cebuano/ceb, Waray/war, Min Nan/zh-min-nan, Ladin/lld, Minangkabau/min
(Cyrillic) Chechen/ce
(Arabic script) South Azerbaijani/azb

Mentioned Here: T332342: Standardize ASCII-folding/ICU-folding across analyzers

TJones triaged this task as Medium priority.Sep 24 2024, 9:20 PM

Nemoralis subscribed.