Page MenuHomePhabricator

AbuseFilter (and AntiSpoof?) not catching certain Unicode equivalencies
Closed, ResolvedPublic

Description

JonKolbert pointed me to the following diff:

https://en.wikiquote.org/w/index.php?title=Talk:Polish_proverbs&diff=2701236&oldid=2701109

A global AbuseFilter that does something like norm(edit_diff) irlike "encyclopediasupreme" should have caught this, but failed to.

The characters being used there are U+1D41E MATHEMATICAL BOLD SMALL E et al. Those were added to equivset.in (which feeds AntiSpoof and AbuseFilter, according to the docs) in rMLEQ4464b4454b48fe2d79b6b84f0810394d6db6b776.

However, this doesn't seem to have ever been deployed, or else something is choking. I visited Special:AbuseFilter/test and tested the following filter against a dummy edit:

ccnorm("𝐞") irlike "e"

Which did not match. I then tested the filter: (U+FF25 FULLWIDTH LATIN CAPITAL LETTER E)

ccnorm("E") irlike "e"

Which matched.

So I suspect that either the changes in the linked diff above were never deployed, or that something in AntiSpoof or AbuseFilter is failing to handle character equivalencies outside of the basic multilingual plane.

Event Timeline

ST47 created this task.Nov 23 2019, 7:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 23 2019, 7:37 PM
ST47 added a subscriber: kolbert.Nov 23 2019, 7:38 PM
kolbert triaged this task as High priority.Nov 23 2019, 7:40 PM
kolbert added projects: AbuseFilter, AntiSpoof.
ST47 added a comment.Nov 23 2019, 7:45 PM

Some more digging: wmf is on version 1.3.0 of https://packagist.org/packages/wikimedia/equivset#1.3.0 . So I guess this is a request to deploy equivset 1.4.0.

Daimona claimed this task.Nov 23 2019, 8:34 PM
Daimona added a subscriber: Daimona.

Yes, you're right in that they were added in equivset 1.4.0. There's been a curious increase in the use of characters from those range -- something which I don't know how to explain. Either way, I had sent https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/550115/ a few days ago, so I'm gonna tag this task now.

I am not sure if these characters have already been accounted for, but this seems to be getting by the filter as well. https://nl.wikipedia.org/w/index.php?title=Overleg:Sylvester_Stallone&diff=prev&oldid=55148739

MusikAnimal added a subscriber: MusikAnimal.EditedNov 26 2019, 5:39 PM

I am not sure if these characters have already been accounted for, but this seems to be getting by the filter as well. https://nl.wikipedia.org/w/index.php?title=Overleg:Sylvester_Stallone&diff=prev&oldid=55148739

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/libs/Equivset/+/master/data/equivset.in#5484 and other lines cover it.

Daimona closed this task as Resolved.Nov 26 2019, 6:40 PM

I am not sure if these characters have already been accounted for, but this seems to be getting by the filter as well. https://nl.wikipedia.org/w/index.php?title=Overleg:Sylvester_Stallone&diff=prev&oldid=55148739

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/libs/Equivset/+/master/data/equivset.in#5484 and other lines cover it.

Correct.

Calling this resolved; it'd be interesting to understand why these characters are popping out so often lately, but that's probably out of scope. To have an idea of the frequency, see e.g. here (requires abusefilter-view-private).

kolbert reopened this task as Open.Nov 26 2019, 11:03 PM

"ɪ" doesn't seem to be covered in the equivset - used to get around filter in https://fr.wikipedia.org/w/index.php?title=Discussion:Boxe&diff=prev&oldid=164879040

"ɪ" doesn't seem to be covered in the equivset - used to get around filter in https://fr.wikipedia.org/w/index.php?title=Discussion:Boxe&diff=prev&oldid=164879040

That's because it's a completely different character: https://www.fileformat.info/info/unicode/char/026a/index.htm

We could add it as well, but then we'd need to release a new version of equivset. Also, I forgot that we need to patch mediawiki/vendor as well: https://gerrit.wikimedia.org/r/#/c/mediawiki/vendor/+/553325/

Anything else to be done here?

sbassett moved this task from Backlog / Other to Done on the acl*security board.EditedDec 16 2019, 10:33 PM
sbassett added a subscriber: sbassett.

@Urbanecm -

Calling this resolved; it'd be interesting to understand why these characters are popping out so often lately, but that's probably out of scope.

Sounds resolved? 550115 and 553325 are both merged and should be in production. I think a separate bug could be filed for the character frequency issue, if desired. Unless anyone objects, I think we can resolve this task and make it public.

Daimona closed this task as Resolved.Dec 17 2019, 2:24 PM

Yes, resolved. I don't think we even need a separate task for those characters, I believe it's not MediaWiki-related.

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".Dec 17 2019, 6:21 PM