Page MenuHomePhabricator

Create a function for AbuseFilter that can normalize HTML entities to their respective UTF-8 characters
Closed, ResolvedPublic

Description

An AbuseFilter to prevent a user based on keyword in his edits was evaded by encoding the text into HTML entities.

We should add a function that normalizes any string into its UTF-8 equivallent

Here's a link of the vandalism

Event Timeline

Huji renamed this task from Vandal uses HTML entity encode to get around AbuseFilter to Create a function for AbuseFilter that can normalize HTML entities to their respective UTF-8 characters.Jun 28 2017, 11:34 PM
Huji triaged this task as High priority.

Marking as high as I think this can be abused at large scale.

Also suggesting that the visibility level of this task be changed (hidden from the public) due to potential for abuse. Once change, I won't be able to see the details of this task any more, so if you need help from me, connect me directly please.

@Huji I'd suggest you adding a filter in the Persian Wikipedia to prevent more than a few amount of entities until this task is resolved, may that happen?

@MohammadtheEditor that I have already done. But I still want t his resolved as it can be abused in other projects similarly.

MohammadtheEditor raised the priority of this task from High to Unbreak Now!.Jul 9 2017, 10:48 AM

Massive attacks are happening in Persian Wikipedia while the troll uses Persian unbreakable space (‌) to get around the abuse filter. This is being used widely by the troll and there isn't a local way around it. I'm changing the priority to the highest since this is becoming an urgent.

Aklapper lowered the priority of this task from Unbreak Now! to High.Jul 9 2017, 11:51 AM

(Reverting priority change. As annoying and distracting as this is, it does not qualify for "immediately drop anything else you work on".)

I don't know how useful this can be with Persian but ccnorm function decodes HTML entities before similar characters are converted.

I don't know how useful this can be with Persian but ccnorm function decodes HTML entities before similar characters are converted.

That decodes things like &lt; to < but it does not transcode things like &#1662; to the letter ت . Interestingly, the command [[&#1662;&#1585;&#1608;&#1606;&#1583;&#1607;: is inerpretted as [[تصویر: (which is localize for [[Image:) and the image is shown! What it means to me is that MW's parser already can transcode these. I just need to figure out where, and then expose that function to AbuseFilter.

>>> Sanitizer::decodeCharReferences('&#1662;&#1585;&#1608;&#1606;&#1583;&#1607;');
=> "پرونده"

@Legoktm should we just modify ccnorm() to pass the text through decodeCharReferences?

Change 406534 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AntiSpoof@master] Remove invisible characters and normalize HTML entities

https://gerrit.wikimedia.org/r/406534

Change 406966 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AbuseFilter@master] Intorducing santize() function

https://gerrit.wikimedia.org/r/406966

Change 406534 abandoned by Huji:
Remove invisible characters and normalize HTML entities

Reason:
Not needed anymore.

https://gerrit.wikimedia.org/r/406534

Huji updated the task description. (Show Details)

Change 406966 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Introduce sanitize() function

https://gerrit.wikimedia.org/r/406966