Page MenuHomePhabricator

AntiSpoof should use language-specific mappings
Open, MediumPublic

Description

Currently, AntiSpoof contains mappings between characters that may not make sense for all languages alike. For instance, there exists a mapping from ڠ to E. This makes sense in Latin-based languages like English or German, but in Persian or Arabic, a mapping from ڠ to غ makes much more sense.

As another example, a mapping from ڪ to ک (as proposed in T173697) makes sense in Persian, but the exact opposite makes more sense in Kurdish. Just because Kurdish Wikipedia is smaller, has fewer edits, and has a smaller community, does not mean we should only allow for one of these mappings to exist.

Therefore, the following changes should be done:

  1. AntiSpoof should have different mappings for different content languages
  2. AntiSpoof::normalizeString()should have a second parameter for a language code. If not provided, it should use the wiki's content language. The function should use the appropriate mapping based on this parameter.

Once implemented, Ideally the AbuseFilter function ccnorm() should also have a second optional parameter for a language code, such that if a particular wiki wants to check different normalizations of a piece of text, it can (e.g. English Wikipedia or Meta Wiki might want to check ccnorm(...) as well as ccnorm(..., 'zh') etc.

As for the implementation, in the first phase we can just move everything into a 'en' group, and later move/edit the mapping for other languages. I can contribute for Persian and Arabic myself.

Lastly, we need to decide whether the username blacklisting feature which works off of AntiSpoof's mapping should also use the content language or not. Considering that usernames are often in Latin characters (even in wikis with a non-Latin content language), it might make sense to do username filtering based on the 'en' mapping at all times (and maybe, *additionally* using the mapping for the content language as well).

Event Timeline

Huji triaged this task as Medium priority.
Huji added a subscriber: Legoktm.

@Legoktm as you are the sole member of AntiSpoof (at least until now that I joined), I added you here to provide insight before I start coding.

for latin languages I collected some similar characters
[ÁáAaÀàÂâÄäÃãǍǎĀāǢǣĂ㥹œÆæÅåΑαΔΛλАаДдЛлПп]
[BbßÞþΒβбВвЗзЉљЊњЪъЫыЬь]
[ĆćCcĈĉÇçČčĊċςСс]
[DdĐđĎďḌḍÐðŒδ]
[ÉéEeÈèÊêËëẼẽĚěĒēĔĕĖėĘęƏəΕεΞξΣБЕеЁёЄєЗзЭэ]
[ĜĝGgĢģĞğĠġ]
[HhĤĥḤḥĦħΗηЂђНнЊњЋћ]
[ÍíIiÌìÎîÏïĨĩǏǐĪīĬĭİıĮįΙιȊІіЇї]
[JjĴĵЈј] [ĶķKkΚκКкЌќ]
[ĹĺLlĻļĽľḶḷḸḹŁłĿŀ]
[ṂṃMmΜМм]
[ŃńNnÑñηΝЙйПпŅņNnŇňṆṇΠπИиЛл]
[OoÓóÒòÔôÖöÕõǑǒŌō0ŎŏǪǫŐőðØøδΘθΟοσΦφΩОоФфЮю]
[pPΡρРр]
[RrŔŕŖŗȒŘřṚṛṜṝЯя]
[SsŚśŜŝŞşŠšṢṣЅѕ]
[ŢTtţŤťṬṭΤτТт]
[UuÚúÛûŨũǓǔǖǘǚǜŰűυUuÙùÜüŮůŪūŬŭŲųμЦцЧчЏџ]
[νVv]
[WwŴŵΨψωШшЩщ]
[XxΧχЖжХх]
[YyÝýŶŷŸÿỸỹȲȳγΥУуЎўЏџ]
[ZzŹźŽžŻżΖζ]
[ΓГ㥴Ѓѓ]

Huji renamed this task from AntiSpoof should be content-language-specific to AntiSpoof should use language-specific mappings.Aug 21 2017, 3:27 PM

I think this would be a good time to split off normalizeString and equivset generation out of AntiSpoof and into a separate library. It will be cleaner to use normalizeString outside the scope of AntiSpoof, and it will leave AntiSpoof with only one responsibility, username spoofing.

Maybe, but that is a totally separate task, and one that I don't want to do myself. That and this can co-occur.

@Huji you are right. I'll be happy to take on the new task after this is done.

@Huji: The more I try to figure out how to actually implement this, the more I think it won't be practical. The main problem is that in real life wikis don't stick to one particular language + character set. Take a look at the Recent Changes feed on Japanese Wikipedia, for example: https://ja.wikipedia.org/wiki/%E7%89%B9%E5%88%A5:%E6%9C%80%E8%BF%91%E3%81%AE%E6%9B%B4%E6%96%B0. The majority of the user names are in Roman/English characters. On Hebrew Wikipedia and Arabic Wikipedia, it seems to be about half. If we only did normalizations in the wiki language, it wouldn't help for half of the users. In most cases, what needs to be done is that both parts of the comparison need to be normalized. That should allow normalization to work in pretty much any circumstance. What is the actual use case that you are trying to fix here?

@kaldari Let me try to explain it better using some examples.

Use case 1: A user is trying to spoof an existing account called USER by creating an account called USڠR. The current mapping from ڠ to E can help us prevent this. (In my proposed solution, this mapping will be called then 'en' mapping)

Use case 2: On English Wikipedia, a user is trying to bypass a filter that does not allow writing the word "USER" (let's assume it is a curse word) in articles. He tries to evade the filter by writing USڠR. We can capture it easily using ccnorm(new_wikitext) rlike '.*USER.*' without having to include all different letters a user might use in place of E. Again, the current mapping works fine for English Wikipedia.

Use case 3: On Arabic Wikipedia, a user is trying to bypass a filter that does not allow writing the word "لغو" (let's assume it is a curse word) in articles. He tries to evade the filter by replace غ with similar looking ڠ (i.e. "لڠو"). Arabic Wikipedia admins currently have to write complex rules like new_wikitext rlike '.*ل[غڠ]و.*' to include all the letters that are visually similar. Ideally, they should be able to just use ccnorm() for this purpose, but they cannot because ccnorm is biased towards English in this case.

Proposed Solution:

Let's create multiple mappings. The 'en' mapping will map ڠ to E, while the 'ar' mapping with map ڠ to غ.

  • For username checks (the main purpose of AntiSpoof), we will always use 'en' mapping in addition to the local language mapping.
  • For AbuseFilter's ccnorm function, by default we will use the mapping based on content language (so in Arabic Wikipedia, we will use the 'ar' mapping by default, while in English Wikipedia use the 'en' mapping).
    • If no mapping is found for a language (e.g. if we don't have a 'de' mapping), we will use a fallback method similar to what we currently use for languages (e.g. the fallback for 'de' is 'en').
    • We will also allow each wiki to use other mappings besides their content language if they wish so, e.g. we will allow English Wikipedia to use ccnorm(new_wikitext) ... to use the default 'en' mapping, and ccnorm(new_wikitext, 'ar') ... to use the 'ar' mapping whenever appropriate.

So now going back to your example: You said If we only did normalizations in the wiki language and that is the part you are wrong. For username checks, we will use both 'en' and the local mappings. For AbuseFilter, we will use the local language, but allow the use of any other mapping as well.

@Huji: Typically the way to deal with case 3 is ccnorm( added_lines ) rlike ccnorm( "لغو" ). If ccnorm is used on both arguments, it doesn't matter which direction the mapping is in. And when checks are made against new usernames, it compares the normalized version of the new username with the normalized version of the existing usernames (so the Kurdish/Persian issue shouldn't be a problem as long the characters exist in the mapping). I agree we need to add more non-English characters to the mapping, but I don't see a use case that actually requires separate language specific mappings yet. I originally supported the idea of having separate mappings, but currently I think having a single comprehensive mapping is still the best solution. If there are other use cases I'm missing, let me know.

@kaldari fair, the ccnorm(..) rlike ccnorm(..) approach does work. But I am thinking of a day when T170504 is implemented and we can actually see the output of these ccnorm(...) calls for debugging purposes. In that setting, your solution for case 3 is not optimal anymore.

Huji removed Huji as the assignee of this task.Sep 12 2017, 4:14 AM
Huji added a project: User-Huji.

@kaldari in https://gerrit.wikimedia.org/r/c/mediawiki/libs/Equivset/+/534969 a related discussion took place. Here is a summary:

In Persian, Arabic-Indic numerals (like ۱, ۲, ۳, ...) are used, as opposed to the Western Arabic numerals 1, 2, 3 that are used in many other languages including English. We have scripts that identify text in which a Persian Wikipedia user incorrectly used Western Arabic numerals, and replaces them with the corresponding Arabic-Indic numerals. This works in plain text, but when the digits are part of a template parameter, this could break things (imagin the number is a year, and template is supposed to use it to calculate an age). In many cases, we update the templates to use {{formatnum:...|R}} to convert the numbers back to Wester Arbaic digits, which is something MediaWiki parser can understand. But that is besides the point.

Lets say we want a filter that can identify if a user changed a number (e.g. changed a year). If they change "2012" to "2013" we want the filter to be triggered. If they change "2012" to "۲۰۱۲" (same thing, in Arabic-Indic) we don't want it to be triggered. Lastly, if they change "2012" to "۲۰۱۳" (which is 2013 in Arabic-Indic) we want it to be triggered. This would require a mapping from ۱ to 1, ۲ to 2, ۳ to 3, ..., ۷ to 7, and so on. However, as @Legoktm points out in a comment to that patch, mapping ۷ to the letter V may make more sense in other contexts.

You can argue that ccnorm is not the best choice here, and that we should use str_replace to normalize digits in both strings (wikitext_old and wikitext_new) and that actually is what I am going to do here. But I think this example shows how what a Persian user might think is the "logical normalization" may differ from one who reads and writes in English or a similar language.