A current scenario is that if we want to block a certain page, i.e.: .*Example.*, the blacklist does not prevent the title to be created even if we set it to <antispoof> and we have to create large regexes with spoofed and non-spoofed characters to avoid it be created, such as .*[eèéëê] which is a pain. TitleBlacklist antispoof features should take a word such as Example and prevent it to be created with and without spoofed characters. There are ongoing cases of abuse and harassment that ain't easy to manage due to this. Thanks.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • TBolliger | T166816 Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools | |||
Resolved | dmaza | T153021 Add some more characters to AntiSpoof mappings for usage in TitleBlacklists |
Event Timeline
Adding @kaldari as he did some work on T29987
Dealing with a regex like title blackllist edit
@MarcoAurelio, @Billinghurst: I'm wondering if this is mostly fixed now that a lot of improvements have been made to AntiSpoof. In theory a lot of those regexes should no longer be needed.
I added a patch for Ø -> O since that seems to be a common one used at the meta blacklist. https://gerrit.wikimedia.org/r/#/c/327030/
Can we make this task public please? The Title blacklist on meta is public, everyone can see that you're already working around this. That said, are you sure those variations of "e" are not normalized by antispoof?
I just checked. All of those variations of "e" are covered by antispoof except for "ê".
Please also blacklist:
- ᛗ for M
- ₷ for S
- ꝛ for R
- ₸ for T
- ƭ for T
- ₷ for S
- Ⲙ for M
- ǐ for I
- ł for L
- ў for Y
- ı for I
Apologies if there are duplicates in the list or some of them are already in the Antispoof.
For reference, you can determine whether characters are already in AntiSpoof by using the testing interface (e.g., https://en.wikipedia.org/wiki/Special:AbuseFilter/tools). Type in ccnorm("eèéëê") and then you'll see EEEEê, indicating only the last one isn't currently handled.
@kaldari <shrug> The background was that I saw the regex added and queried Trinjstel about antispoof, and said that if the antispoof wasn't suitably working then it would need a phabricator ticket to fix it. With regard to usability, one can never tell how effective they are as there is no visible log of title blacklist (there is another ticket about that around here) and the only way to know that it is ineffective is when a usage occurs.
Plus as a non-programmer I went looking for the antispoof regex in the code and after unsuccessfully flailing around in numbers of places I just gave up.
@Legoktm I concur about opening this ticket; though as it is a steward's ticket I would prefer to see one of the stewards make it public.
ccnorm("y̐ṭ́ꭆM̪m̪ʂᛗ₷ꝛ₸ƭ₷Ⲙǐłўı") -> Y̐T́ꭆM̪M̪Sᛗ₷ꝛ₸T₷ⲘILўI which looks to me as six ticks, and 11 crosses.
from the noted regex
ccnorm("[Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][ɵ€][ŔŖŘ]") -> [Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][O€][ŔŖŘ]
and apologies for any duplicates to prior box, this monitor doesn't quite show the minutiæ.
So if I want to avoid the creation of "Example" with an abusefilter in all
forms possible shall I instead of using ccnorm('Example') use
ccnorm('weirdfancychars')? :/
yes, it now follows the KISS principle
oops simple is the usage ... ccnorm("Example")
From some large title blacklist entries:
- ccnorm('AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆ4@') --> AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆA@
- ccnorm('RŔŖŘȐȒṘṚṜṞ®Ρ₧ÞþΡρРрƤṔṖǷ') --> RŔŖŘȐȒṘṚṜṞRPPÞPPPPPPṔṖP
In fact I think none of us can alter policies of tasks. I initially opened this as private because I didn't knew if I'd be in need to disclose some details of what brings me here. Fortunately this has not been necessary so we can make this task public. If the need appears, I can escalate the task.
So if I want to avoid the creation of "Example" with an abusefilter in all forms possible shall I instead of using ccnorm('Example') use ccnorm('weirdfancychars')? :/
You would use ccnorm('Example').
The equivalence sets are fetches from https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets where anyone can edit the list. Then a mere make rebuilds everything.
Direct edits to equivset.in desynchronize it, though.
Another one slips through the net: https://en.wikipedia.org/w/index.php?title=L%D0%BEvifm.com_(Musical_base)&action=edit not blocked by .*lovifm.* <antispoof>
@MER-C this is abusing confusion between the latin o with "о" the U+043E CYRILLIC SMALL LETTER O, but that equivalence has been there for years...
@TBolliger @kaldari I'm confused as to what needs to be done here.
I've compiled this list of characters based on the discussion so far. I'm yet to check against what we currently have.
A => Ǽ A À Ⓐ Á Â Ã Ä Å Ā Ă Ą Ǎ Ǟ Ǡ Ǻ Ȁ Ȃ Ȧ Ḁ Ạ Ả Ấ Ầ Ẩ Ẫ Ậ Ắ Ằ Ẳ Ẵ Ặ Æ 4 @ E => ê I => ǐ ı L => ĹĻĽĿ M => M̪ m̪ ᛗ Ⲙ P => ₧ Þ þ Ρ ρ Р р Ƥ Ṕ Ṗ Ƿ R => Ŕ Ŗ Ř Ȑ Ȓ Ṙ Ṛ Ṝ Ṟ ® ꭆ Ř S => ʂ ₷ Ś Ŝ Ş Š T => ṭ́ ₸ ƭ Ţ Ť Y => y̐ ў γ ȳ ẙ Ύ Ῠ Ὺ ў ӯ ӱ ӳ
These I don't agree with:
ꝛ <> R
ł, ℓ <> L
Do we want to add all of these? Does someone needs to vet this change?
@dmaza: Yes, we can add all of those (that aren't already in the equivset). No one needs to vet it other than whoever reviews the patch.
Change 377365 had a related patch set uploaded (by Dmaza; owner: Dmaza):
[mediawiki/extensions/AntiSpoof@master] Add more characters to AntiSpoof mappings
ccnorm("ǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẪẮẰẲẴẶÆ4@êǐıĹĻĽĿ₧ÞþΡρРрƤṔṖǷŔŖŘȐȒṘṚṜṞ®ꭆŘʂ₷ŚŜŞŠ₸ƭŢŤўγȳẙΎῨῪўӯӱӳłℓᛗⲘM̪m̪y̐ṭ́ ") becomes AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIILLLLPPPPPPPPPPPRRRRRRRRRRRRSSSSSSTTTTYYYYYYYYYYYLLMMM̪M̪Y̐T́
The four characters left (T́, M̪,m̪, Y̐) are a combination of basic latin (T, M, m, Y) with diacritical marks. I'm not sure how those work, if someone know, please explain.
I'm pretty sure AntiSpoof is not going to be able to handle combined characters in it's current implementation. Right now we can only map a single character to a single character. T́ is technically 2 unicode characters/codepoints combined into 1 visual character. (We also can't handle Æ -> AE.)
True, but I think AntiSpoof should actually *remove* certain characters, including diacritical characters, the Arabic/Persian vowel characters (example is Arabic kasra, U+0650), and all invisible characters (examples include ZWNJ, ZWJ, etc).
This requires its own task though.
Change 377365 merged by jenkins-bot:
[mediawiki/extensions/AntiSpoof@master] Add more characters to AntiSpoof mappings
@Legoktm this task still has Security: software security bug; you may wish to remove that one too. Regards.