Add some more characters to AntiSpoof mappings for usage in TitleBlacklists
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	MarcoAurelio
	Dec 12 2016, 11:31 PM

Description

A current scenario is that if we want to block a certain page, i.e.: .*Example.*, the blacklist does not prevent the title to be created even if we set it to <antispoof> and we have to create large regexes with spoofed and non-spoofed characters to avoid it be created, such as .*[eèéëê] which is a pain. TitleBlacklist antispoof features should take a word such as Example and prevent it to be created with and without spoofed characters. There are ongoing cases of abuse and harassment that ain't easy to manage due to this. Thanks.

Details

	Subject	Repo	Branch	Lines +/-
	Add more characters to AntiSpoof mappings	mediawiki/extensions/AntiSpoof	master	+158 -16
	Adding Ø -> O to equivset for AntiSpoof	mediawiki/extensions/AntiSpoof	master	+4 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• TBolliger	T166816 Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools
		Resolved		dmaza	T153021 Add some more characters to AntiSpoof mappings for usage in TitleBlacklists

Event Timeline

MarcoAurelio created this task.Dec 12 2016, 11:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2016, 11:31 PM

MarcoAurelio triaged this task as High priority.Dec 12 2016, 11:32 PM

MarcoAurelio added projects: Security-Extensions, TitleBlacklist.

MarcoAurelio updated the task description. (Show Details)

MarcoAurelio added a subscriber: Trijnstel.Dec 12 2016, 11:35 PM

Trijnstel added a subscriber: Billinghurst.Dec 13 2016, 12:09 PM

Adding @kaldari as he did some work on T29987

Dealing with a regex like title blackllist edit

AntiSpoof too perhaps?

@MarcoAurelio, @Billinghurst: I'm wondering if this is mostly fixed now that a lot of improvements have been made to AntiSpoof. In theory a lot of those regexes should no longer be needed.

I added a patch for Ø -> O since that seems to be a common one used at the meta blacklist. https://gerrit.wikimedia.org/r/#/c/327030/

Can we make this task public please? The Title blacklist on meta is public, everyone can see that you're already working around this. That said, are you sure those variations of "e" are not normalized by antispoof?

I just checked. All of those variations of "e" are covered by antispoof except for "ê".

Please also blacklist:

ᛗ for M
₷ for S
ꝛ for R
₸ for T
ƭ for T
₷ for S
Ⲙ for M
ǐ for I
ł for L
ў for Y
ı for I

Apologies if there are duplicates in the list or some of them are already in the Antispoof.

Also:

y̐
ṭ́
ꭆ
M̪
m̪
ʂ

For reference, you can determine whether characters are already in AntiSpoof by using the testing interface (e.g., https://en.wikipedia.org/wiki/Special:AbuseFilter/tools). Type in ccnorm("eèéëê") and then you'll see EEEEê, indicating only the last one isn't currently handled.

Legoktm renamed this task from TitleBlacklist: better prevention of characters spoofing to Add some more characters to AntiSpoof mappings for usage in TitleBlacklists.Dec 14 2016, 6:52 AM

Legoktm edited projects, added AntiSpoof; removed TitleBlacklist.

In T153021#2870076, @kaldari wrote:

@MarcoAurelio, @Billinghurst: I'm wondering if this is mostly fixed now that a lot of improvements have been made to AntiSpoof. In theory a lot of those regexes should no longer be needed.

@kaldari <shrug> The background was that I saw the regex added and queried Trinjstel about antispoof, and said that if the antispoof wasn't suitably working then it would need a phabricator ticket to fix it. With regard to usability, one can never tell how effective they are as there is no visible log of title blacklist (there is another ticket about that around here) and the only way to know that it is ineffective is when a usage occurs.

Plus as a non-programmer I went looking for the antispoof regex in the code and after unsuccessfully flailing around in numbers of places I just gave up.

@Legoktm I concur about opening this ticket; though as it is a steward's ticket I would prefer to see one of the stewards make it public.

ccnorm("y̐ṭ́ꭆM̪m̪ʂᛗ₷ꝛ₸ƭ₷Ⲙǐłўı") -> Y̐T́ꭆM̪M̪Sᛗ₷ꝛ₸T₷ⲘILўI which looks to me as six ticks, and 11 crosses.

from the noted regex

ccnorm("[Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][ɵ€][ŔŖŘ]") -> [Ⲙ𐌼][γўȳẙΎῨῪўӯӱӳ].*[ŚŜŞŠ][ĹĻĽĿℓ][ŢŤ][O€][ŔŖŘ]

and apologies for any duplicates to prior box, this monitor doesn't quite show the minutiæ.

So if I want to avoid the creation of "Example" with an abusefilter in all
forms possible shall I instead of using ccnorm('Example') use
ccnorm('weirdfancychars')? :/

yes, it now follows the KISS principle

oops simple is the usage ... ccnorm("Example")

From some large title blacklist entries:

ccnorm('AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆ4@') --> AǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẦẨẪẬẮẰẲẴẶÆA@

ccnorm('RŔŖŘȐȒṘṚṜṞ®Ρ₧ÞþΡρРрƤṔṖǷ') --> RŔŖŘȐȒṘṚṜṞRPPÞPPPPPPṔṖP

In T153021#2872114, @Billinghurst wrote:

@Legoktm I concur about opening this ticket; though as it is a steward's ticket I would prefer to see one of the stewards make it public.

I don't think there are many stewards who are able to do that...

In fact I think none of us can alter policies of tasks. I initially opened this as private because I didn't knew if I'd be in need to disclose some details of what brings me here. Fortunately this has not been necessary so we can make this task public. If the need appears, I can escalate the task.

So if I want to avoid the creation of "Example" with an abusefilter in all forms possible shall I instead of using ccnorm('Example') use ccnorm('weirdfancychars')? :/

You would use ccnorm('Example').

Legoktm removed projects: Security-Extensions, acl*security.Dec 14 2016, 7:20 PM

Legoktm changed the visibility from "Custom Policy" to "Public (No Login Required)".

Restricted Application added a project: acl*security. · View Herald TranscriptDec 14 2016, 7:20 PM

The equivalence sets are fetches from https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets where anyone can edit the list. Then a mere make rebuilds everything.
Direct edits to equivset.in desynchronize it, though.

Billinghurst merged a task: T149785: Adding "ı" as an i alternative in antispoof.Dec 25 2016, 10:26 PM

Billinghurst added subscribers: DatGuy, He7d3r, Stryn.

MarcoAurelio added a project: Security-Extensions.Jan 22 2017, 5:01 PM

• TBolliger subscribed.Mar 9 2017, 1:11 AM

• TBolliger added a project: Anti-Harassment.Mar 9 2017, 7:09 PM

• TBolliger moved this task from Untriaged to Snackbox on the Anti-Harassment board.Jun 1 2017, 4:55 PM

Another one slips through the net: https://en.wikipedia.org/w/index.php?title=L%D0%BEvifm.com_(Musical_base)&action=edit not blocked by .*lovifm.* <antispoof>

• TBolliger moved this task from Snackbox to Triage/To be Estimated on the Anti-Harassment board.Aug 16 2017, 9:35 PM

Restricted Application added a project: User-MarcoAurelio. · View Herald TranscriptAug 16 2017, 9:35 PM

MarcoAurelio moved this task from unsorted/backlog to radar on the User-MarcoAurelio board.Aug 17 2017, 3:04 PM

@MER-C this is abusing confusion between the latin o with "о" the U+043E CYRILLIC SMALL LETTER O, but that equivalence has been there for years...

kaldari set the point value for this task to 2.Aug 22 2017, 5:20 PM

• TBolliger moved this task from Triage/To be Estimated to Cards ready for development on the Anti-Harassment board.Aug 22 2017, 5:21 PM

• TBolliger added a parent task: T166816: Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools.Aug 25 2017, 7:00 PM

dbarratt moved this task from Cards ready for development to AHT Sprint 4 on the Anti-Harassment board.Aug 31 2017, 4:39 PM

dbarratt edited projects, added Anti-Harassment (AHT Sprint 4); removed Anti-Harassment.

dmaza claimed this task.Sep 1 2017, 1:58 PM

dmaza moved this task from Ready to In progress on the Anti-Harassment (AHT Sprint 4) board.Sep 8 2017, 9:36 PM

@TBolliger @kaldari I'm confused as to what needs to be done here.
I've compiled this list of characters based on the discussion so far. I'm yet to check against what we currently have.

A => Ǽ A À Ⓐ Á Â Ã Ä Å Ā Ă Ą Ǎ Ǟ Ǡ Ǻ Ȁ Ȃ Ȧ Ḁ Ạ Ả Ấ  Ầ Ẩ Ẫ Ậ Ắ Ằ Ẳ Ẵ Ặ Æ 4 @
E => ê
I => ǐ ı
L => ĹĻĽĿ
M => M̪ m̪ ᛗ Ⲙ
P => ₧ Þ þ Ρ ρ Р р Ƥ Ṕ Ṗ Ƿ 
R => Ŕ Ŗ Ř Ȑ Ȓ Ṙ Ṛ Ṝ  Ṟ ® ꭆ Ř
S => ʂ ₷ Ś Ŝ Ş Š
T => ṭ́ ₸ ƭ Ţ Ť
Y => y̐ ў γ ȳ ẙ Ύ Ῠ Ὺ ў ӯ ӱ ӳ

These I don't agree with:
ꝛ <> R
ł, ℓ <> L

Do we want to add all of these? Does someone needs to vet this change?

@dmaza: Yes, we can add all of those (that aren't already in the equivset). No one needs to vet it other than whoever reviews the patch.

I would agree with omitting ꝛ but the other two do look very much like ls to me.

Change 377365 had a related patch set uploaded (by Dmaza; owner: Dmaza):
[mediawiki/extensions/AntiSpoof@master] Add more characters to AntiSpoof mappings

https://gerrit.wikimedia.org/r/377365

gerritbot added a project: Patch-For-Review.Sep 11 2017, 10:46 PM

ccnorm("ǼAÀⒶÁÂÃÄÅĀĂĄǍǞǠǺȀȂȦḀẠẢẤẪẮẰẲẴẶÆ4@êǐıĹĻĽĿ₧ÞþΡρРрƤṔṖǷŔŖŘȐȒṘṚṜṞ®ꭆŘʂ₷ŚŜŞŠ₸ƭŢŤўγȳẙΎῨῪўӯӱӳłℓᛗⲘM̪m̪y̐ṭ́ ") becomes AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEIILLLLPPPPPPPPPPPRRRRRRRRRRRRSSSSSSTTTTYYYYYYYYYYYLLMMM̪M̪Y̐T́

The four characters left (T́, M̪,m̪, Y̐) are a combination of basic latin (T, M, m, Y) with diacritical marks. I'm not sure how those work, if someone know, please explain.

dmaza moved this task from In progress to Code Review on the Anti-Harassment (AHT Sprint 4) board.Sep 11 2017, 10:49 PM

I'm pretty sure AntiSpoof is not going to be able to handle combined characters in it's current implementation. Right now we can only map a single character to a single character. T́ is technically 2 unicode characters/codepoints combined into 1 visual character. (We also can't handle Æ -> AE.)

True, but I think AntiSpoof should actually *remove* certain characters, including diacritical characters, the Arabic/Persian vowel characters (example is Arabic kasra, U+0650), and all invisible characters (examples include ZWNJ, ZWJ, etc).

This requires its own task though.

• TBolliger moved this task from AHT Sprint 4 to AHT Sprint 5 on the Anti-Harassment board.Sep 12 2017, 6:37 PM

• TBolliger edited projects, added Anti-Harassment (AHT Sprint 5); removed Anti-Harassment (AHT Sprint 4).

dmaza moved this task from Ready to Code Review on the Anti-Harassment (AHT Sprint 5) board.Sep 12 2017, 6:41 PM

Change 377365 merged by jenkins-bot:
[mediawiki/extensions/AntiSpoof@master] Add more characters to AntiSpoof mappings

https://gerrit.wikimedia.org/r/377365

dmaza moved this task from Code Review to Done on the Anti-Harassment (AHT Sprint 5) board.Sep 19 2017, 12:39 AM

• TBolliger closed this task as Resolved.Sep 19 2017, 6:06 PM

@Legoktm this task still has Security: software security bug; you may wish to remove that one too. Regards.

• chasemp added a project: Security.Feb 10 2020, 10:55 PM

• chasemp removed a project: acl*security.Feb 20 2020, 8:15 PM

Add some more characters to AntiSpoof mappings for usage in TitleBlacklistsClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add some more characters to AntiSpoof mappings for usage in TitleBlacklists
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...