Page MenuHomePhabricator

AbuseFilter (and AntiSpoof?) not catching certain Unicode equivalencies
Closed, ResolvedPublic

Description

JonKolbert pointed me to the following diff:

https://en.wikiquote.org/w/index.php?title=Talk:Polish_proverbs&diff=2701236&oldid=2701109

A global AbuseFilter that does something like norm(edit_diff) irlike "encyclopediasupreme" should have caught this, but failed to.

The characters being used there are U+1D41E MATHEMATICAL BOLD SMALL E et al. Those were added to equivset.in (which feeds AntiSpoof and AbuseFilter, according to the docs) in rMLEQ4464b4454b48fe2d79b6b84f0810394d6db6b776.

However, this doesn't seem to have ever been deployed, or else something is choking. I visited Special:AbuseFilter/test and tested the following filter against a dummy edit:

ccnorm("𝐞") irlike "e"

Which did not match. I then tested the filter: (U+FF25 FULLWIDTH LATIN CAPITAL LETTER E)

ccnorm("E") irlike "e"

Which matched.

So I suspect that either the changes in the linked diff above were never deployed, or that something in AntiSpoof or AbuseFilter is failing to handle character equivalencies outside of the basic multilingual plane.

Event Timeline

kolbert triaged this task as High priority.Nov 23 2019, 7:40 PM
kolbert added projects: AbuseFilter, AntiSpoof.

Some more digging: wmf is on version 1.3.0 of https://packagist.org/packages/wikimedia/equivset#1.3.0 . So I guess this is a request to deploy equivset 1.4.0.

Daimona subscribed.

Yes, you're right in that they were added in equivset 1.4.0. There's been a curious increase in the use of characters from those range -- something which I don't know how to explain. Either way, I had sent https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/550115/ a few days ago, so I'm gonna tag this task now.

I am not sure if these characters have already been accounted for, but this seems to be getting by the filter as well. https://nl.wikipedia.org/w/index.php?title=Overleg:Sylvester_Stallone&diff=prev&oldid=55148739

I am not sure if these characters have already been accounted for, but this seems to be getting by the filter as well. https://nl.wikipedia.org/w/index.php?title=Overleg:Sylvester_Stallone&diff=prev&oldid=55148739

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/libs/Equivset/+/master/data/equivset.in#5484 and other lines cover it.

Correct.

Calling this resolved; it'd be interesting to understand why these characters are popping out so often lately, but that's probably out of scope. To have an idea of the frequency, see e.g. here (requires abusefilter-view-private).

"ɪ" doesn't seem to be covered in the equivset - used to get around filter in https://fr.wikipedia.org/w/index.php?title=Discussion:Boxe&diff=prev&oldid=164879040

"ɪ" doesn't seem to be covered in the equivset - used to get around filter in https://fr.wikipedia.org/w/index.php?title=Discussion:Boxe&diff=prev&oldid=164879040

That's because it's a completely different character: https://www.fileformat.info/info/unicode/char/026a/index.htm

We could add it as well, but then we'd need to release a new version of equivset. Also, I forgot that we need to patch mediawiki/vendor as well: https://gerrit.wikimedia.org/r/#/c/mediawiki/vendor/+/553325/

sbassett subscribed.

@Urbanecm -

Calling this resolved; it'd be interesting to understand why these characters are popping out so often lately, but that's probably out of scope.

Sounds resolved? 550115 and 553325 are both merged and should be in production. I think a separate bug could be filed for the character frequency issue, if desired. Unless anyone objects, I think we can resolve this task and make it public.

Yes, resolved. I don't think we even need a separate task for those characters, I believe it's not MediaWiki-related.

sbassett changed the visibility from "Custom Policy" to "Public (No Login Required)".Dec 17 2019, 6:21 PM