A current scenario is that if we want to block a certain page, i.e.: .*Example.*, the blacklist does not prevent the title to be created even if we set it to <antispoof> and we have to create large regexes with spoofed and non-spoofed characters to avoid it be created, such as .*[eèéëê] which is a pain. TitleBlacklist antispoof features should take a word such as Example and prevent it to be created with and without spoofed characters. There are ongoing cases of abuse and harassment that ain't easy to manage due to this. Thanks.
|Resolved||TBolliger||T166816 Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools|
|Resolved||dmaza||T153021 Add some more characters to AntiSpoof mappings for usage in TitleBlacklists|
Can we make this task public please? The Title blacklist on meta is public, everyone can see that you're already working around this. That said, are you sure those variations of "e" are not normalized by antispoof?
Please also blacklist:
- ᛗ for M
- ₷ for S
- ꝛ for R
- ₸ for T
- ƭ for T
- ₷ for S
- Ⲙ for M
- ǐ for I
- ł for L
- ў for Y
- ı for I
Apologies if there are duplicates in the list or some of them are already in the Antispoof.
For reference, you can determine whether characters are already in AntiSpoof by using the testing interface (e.g., https://en.wikipedia.org/wiki/Special:AbuseFilter/tools). Type in ccnorm("eèéëê") and then you'll see EEEEê, indicating only the last one isn't currently handled.
@kaldari <shrug> The background was that I saw the regex added and queried Trinjstel about antispoof, and said that if the antispoof wasn't suitably working then it would need a phabricator ticket to fix it. With regard to usability, one can never tell how effective they are as there is no visible log of title blacklist (there is another ticket about that around here) and the only way to know that it is ineffective is when a usage occurs.
Plus as a non-programmer I went looking for the antispoof regex in the code and after unsuccessfully flailing around in numbers of places I just gave up.
@Legoktm I concur about opening this ticket; though as it is a steward's ticket I would prefer to see one of the stewards make it public.
from the noted regex
ccnorm("[Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][ɵ€][ŔŖŘ]") -> [Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][O€][ŔŖŘ]
and apologies for any duplicates to prior box, this monitor doesn't quite show the minutiæ.
From some large title blacklist entries:
- ccnorm('AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆ4@') --> AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆA@
- ccnorm('RŔŖŘȐȒṘṚṜṞ®Ρ₧ÞþΡρРрƤṔṖǷ') --> RŔŖŘȐȒṘṚṜṞRPPÞPPPPPPṔṖP
In fact I think none of us can alter policies of tasks. I initially opened this as private because I didn't knew if I'd be in need to disclose some details of what brings me here. Fortunately this has not been necessary so we can make this task public. If the need appears, I can escalate the task.
So if I want to avoid the creation of "Example" with an abusefilter in all forms possible shall I instead of using ccnorm('Example') use ccnorm('weirdfancychars')? :/
You would use ccnorm('Example').
The equivalence sets are fetches from https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets where anyone can edit the list. Then a mere make rebuilds everything.
Direct edits to equivset.in desynchronize it, though.
A => Ǽ A À Ⓐ Á Â Ã Ä Å Ā Ă Ą Ǎ Ǟ Ǡ Ǻ Ȁ Ȃ Ȧ Ḁ Ạ Ả Ấ Ầ Ẩ Ẫ Ậ Ắ Ằ Ẳ Ẵ Ặ Æ 4 @ E => ê I => ǐ ı L => ĹĻĽĿ M => M̪ m̪ ᛗ Ⲙ P => ₧ Þ þ Ρ ρ Р р Ƥ Ṕ Ṗ Ƿ R => Ŕ Ŗ Ř Ȑ Ȓ Ṙ Ṛ Ṝ Ṟ ® ꭆ Ř S => ʂ ₷ Ś Ŝ Ş Š T => ṭ́ ₸ ƭ Ţ Ť Y => y̐ ў γ ȳ ẙ Ύ Ῠ Ὺ ў ӯ ӱ ӳ
These I don't agree with:
ꝛ <> R
ł, ℓ <> L
Do we want to add all of these? Does someone needs to vet this change?
ccnorm("ǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẪẮẰẲẴẶÆ4@êǐıĹĻĽĿ₧ÞþΡρРрƤṔṖǷŔŖŘȐȒṘṚṜṞ®ꭆŘʂ₷ŚŜŞŠ₸ƭŢŤўγȳẙΎῨῪўӯӱӳłℓᛗⲘM̪m̪y̐ṭ́ ") becomes AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIILLLLPPPPPPPPPPPRRRRRRRRRRRRSSSSSSTTTTYYYYYYYYYYYLLMMM̪M̪Y̐T́
The four characters left (T́, M̪,m̪, Y̐) are a combination of basic latin (T, M, m, Y) with diacritical marks. I'm not sure how those work, if someone know, please explain.
I'm pretty sure AntiSpoof is not going to be able to handle combined characters in it's current implementation. Right now we can only map a single character to a single character. T́ is technically 2 unicode characters/codepoints combined into 1 visual character. (We also can't handle Æ -> AE.)