Page MenuHomePhabricator

Add some more characters to AntiSpoof mappings for usage in TitleBlacklists
Closed, ResolvedPublic2 Estimated Story Points

Description

A current scenario is that if we want to block a certain page, i.e.: .*Example.*, the blacklist does not prevent the title to be created even if we set it to <antispoof> and we have to create large regexes with spoofed and non-spoofed characters to avoid it be created, such as .*[eèéëê] which is a pain. TitleBlacklist antispoof features should take a word such as Example and prevent it to be created with and without spoofed characters. There are ongoing cases of abuse and harassment that ain't easy to manage due to this. Thanks.

Event Timeline

Adding @kaldari as he did some work on T29987

Dealing with a regex like title blackllist edit

@MarcoAurelio, @Billinghurst: I'm wondering if this is mostly fixed now that a lot of improvements have been made to AntiSpoof. In theory a lot of those regexes should no longer be needed.

I added a patch for Ø -> O since that seems to be a common one used at the meta blacklist. https://gerrit.wikimedia.org/r/#/c/327030/

Can we make this task public please? The Title blacklist on meta is public, everyone can see that you're already working around this. That said, are you sure those variations of "e" are not normalized by antispoof?

I just checked. All of those variations of "e" are covered by antispoof except for "ê".

Please also blacklist:

  • for M
  • for S
  • for R
  • for T
  • ƭ for T
  • for S
  • for M
  • ǐ for I
  • ł for L
  • ў for Y
  • ı for I

Apologies if there are duplicates in the list or some of them are already in the Antispoof.

Also:

  • ṭ́
  • ʂ

For reference, you can determine whether characters are already in AntiSpoof by using the testing interface (e.g., https://en.wikipedia.org/wiki/Special:AbuseFilter/tools). Type in ccnorm("eèéëê") and then you'll see EEEEê, indicating only the last one isn't currently handled.

Legoktm renamed this task from TitleBlacklist: better prevention of characters spoofing to Add some more characters to AntiSpoof mappings for usage in TitleBlacklists.Dec 14 2016, 6:52 AM
Legoktm edited projects, added AntiSpoof; removed TitleBlacklist.

@MarcoAurelio, @Billinghurst: I'm wondering if this is mostly fixed now that a lot of improvements have been made to AntiSpoof. In theory a lot of those regexes should no longer be needed.

@kaldari <shrug> The background was that I saw the regex added and queried Trinjstel about antispoof, and said that if the antispoof wasn't suitably working then it would need a phabricator ticket to fix it. With regard to usability, one can never tell how effective they are as there is no visible log of title blacklist (there is another ticket about that around here) and the only way to know that it is ineffective is when a usage occurs.

Plus as a non-programmer I went looking for the antispoof regex in the code and after unsuccessfully flailing around in numbers of places I just gave up.

@Legoktm I concur about opening this ticket; though as it is a steward's ticket I would prefer to see one of the stewards make it public.

ccnorm("y̐ṭ́ꭆM̪m̪ʂᛗ₷ꝛ₸ƭ₷Ⲙǐłўı") -> Y̐T́ꭆM̪M̪Sᛗ₷ꝛ₸T₷ⲘILўI which looks to me as six ticks, and 11 crosses.

from the noted regex

ccnorm("[Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][ɵ€][ŔŖŘ]") -> [Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][O€][ŔŖŘ]

and apologies for any duplicates to prior box, this monitor doesn't quite show the minutiæ.

So if I want to avoid the creation of "Example" with an abusefilter in all
forms possible shall I instead of using ccnorm('Example') use
ccnorm('weirdfancychars')? :/

yes, it now follows the KISS principle

oops simple is the usage ... ccnorm("Example")

From some large title blacklist entries:

  • ccnorm('AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆ4@') --> AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆA@
  • ccnorm('RŔŖŘȐȒṘṚṜṞ®Ρ₧ÞþΡρРрƤṔṖǷ') --> RŔŖŘȐȒṘṚṜṞRPPÞPPPPPPṔṖP

@Legoktm I concur about opening this ticket; though as it is a steward's ticket I would prefer to see one of the stewards make it public.

I don't think there are many stewards who are able to do that...

In fact I think none of us can alter policies of tasks. I initially opened this as private because I didn't knew if I'd be in need to disclose some details of what brings me here. Fortunately this has not been necessary so we can make this task public. If the need appears, I can escalate the task.

So if I want to avoid the creation of "Example" with an abusefilter in all forms possible shall I instead of using ccnorm('Example') use ccnorm('weirdfancychars')? :/

You would use ccnorm('Example').

Legoktm changed the visibility from "Custom Policy" to "Public (No Login Required)".

The equivalence sets are fetches from https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets where anyone can edit the list. Then a mere make rebuilds everything.
Direct edits to equivset.in desynchronize it, though.

@MER-C this is abusing confusion between the latin o with "о" the U+043E CYRILLIC SMALL LETTER O, but that equivalence has been there for years...

kaldari set the point value for this task to 2.Aug 22 2017, 5:20 PM

@TBolliger @kaldari I'm confused as to what needs to be done here.
I've compiled this list of characters based on the discussion so far. I'm yet to check against what we currently have.

A => Ǽ A À Ⓐ Á Â Ã Ä Å Ā Ă Ą Ǎ Ǟ Ǡ Ǻ Ȁ Ȃ Ȧ Ḁ Ạ Ả Ấ  Ầ Ẩ Ẫ Ậ Ắ Ằ Ẳ Ẵ Ặ Æ 4 @
E => ê
I => ǐ ı
L => ĹĻĽĿ
M => M̪ m̪ ᛗ Ⲙ
P => ₧ Þ þ Ρ ρ Р р Ƥ Ṕ Ṗ Ƿ 
R => Ŕ Ŗ Ř Ȑ Ȓ Ṙ Ṛ Ṝ  Ṟ ® ꭆ Ř
S => ʂ ₷ Ś Ŝ Ş Š
T => ṭ́ ₸ ƭ Ţ Ť
Y => y̐ ў γ ȳ ẙ Ύ Ῠ Ὺ ў ӯ ӱ ӳ

These I don't agree with:
<> R
ł, <> L

Do we want to add all of these? Does someone needs to vet this change?

@dmaza: Yes, we can add all of those (that aren't already in the equivset). No one needs to vet it other than whoever reviews the patch.

I would agree with omitting ꝛ but the other two do look very much like ls to me.

Change 377365 had a related patch set uploaded (by Dmaza; owner: Dmaza):
[mediawiki/extensions/AntiSpoof@master] Add more characters to AntiSpoof mappings

https://gerrit.wikimedia.org/r/377365

ccnorm("ǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẪẮẰẲẴẶÆ4@êǐıĹĻĽĿ₧ÞþΡρРрƤṔṖǷŔŖŘȐȒṘṚṜṞ®ꭆŘʂ₷ŚŜŞŠ₸ƭŢŤўγȳẙΎῨῪўӯӱӳłℓᛗⲘM̪m̪y̐ṭ́ ") becomes AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIILLLLPPPPPPPPPPPRRRRRRRRRRRRSSSSSSTTTTYYYYYYYYYYYLLMMM̪M̪Y̐T́

The four characters left (, ,, ) are a combination of basic latin (T, M, m, Y) with diacritical marks. I'm not sure how those work, if someone know, please explain.

I'm pretty sure AntiSpoof is not going to be able to handle combined characters in it's current implementation. Right now we can only map a single character to a single character. is technically 2 unicode characters/codepoints combined into 1 visual character. (We also can't handle Æ -> AE.)

True, but I think AntiSpoof should actually *remove* certain characters, including diacritical characters, the Arabic/Persian vowel characters (example is Arabic kasra, U+0650), and all invisible characters (examples include ZWNJ, ZWJ, etc).

This requires its own task though.

Change 377365 merged by jenkins-bot:
[mediawiki/extensions/AntiSpoof@master] Add more characters to AntiSpoof mappings

https://gerrit.wikimedia.org/r/377365

@Legoktm this task still has Security: software security bug; you may wish to remove that one too. Regards.