Currently, AntiSpoof contains mappings between characters that may not make sense for all languages alike. For instance, there exists a mapping from ڠ to E. This makes sense in Latin-based languages like English or German, but in Persian or Arabic, a mapping from ڠ to غ makes much more sense.
As another example, a mapping from ڪ to ک (as proposed in T173697) makes sense in Persian, but the exact opposite makes more sense in Kurdish. Just because Kurdish Wikipedia is smaller, has fewer edits, and has a smaller community, does not mean we should only allow for one of these mappings to exist.
Therefore, the following changes should be done:
- AntiSpoof should have different mappings for different content languages
- AntiSpoof::normalizeString()should have a second parameter for a language code. If not provided, it should use the wiki's content language. The function should use the appropriate mapping based on this parameter.
Once implemented, Ideally the AbuseFilter function ccnorm() should also have a second optional parameter for a language code, such that if a particular wiki wants to check different normalizations of a piece of text, it can (e.g. English Wikipedia or Meta Wiki might want to check ccnorm(...) as well as ccnorm(..., 'zh') etc.
As for the implementation, in the first phase we can just move everything into a 'en' group, and later move/edit the mapping for other languages. I can contribute for Persian and Arabic myself.
Lastly, we need to decide whether the username blacklisting feature which works off of AntiSpoof's mapping should also use the content language or not. Considering that usernames are often in Latin characters (even in wikis with a non-Latin content language), it might make sense to do username filtering based on the 'en' mapping at all times (and maybe, *additionally* using the mapping for the content language as well).