Create a function for AbuseFilter that can normalize HTML entities to their respective UTF-8 characters
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MohammadtheEditor
	Jun 28 2017, 5:34 PM

Description

An AbuseFilter to prevent a user based on keyword in his edits was evaded by encoding the text into HTML entities.

We should add a function that normalizes any string into its UTF-8 equivallent

Here's a link of the vandalism

Details

	Subject	Repo	Branch	Lines +/-
	Introduce sanitize() function	mediawiki/extensions/AbuseFilter	master	+25 -0
	Remove invisible characters and normalize HTML entities	mediawiki/extensions/AntiSpoof	master	+22 -0

Customize query in gerrit

Event Timeline

MohammadtheEditor created this task.Jun 28 2017, 5:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2017, 5:34 PM

Marking as high as I think this can be abused at large scale.

Also suggesting that the visibility level of this task be changed (hidden from the public) due to potential for abuse. Once change, I won't be able to see the details of this task any more, so if you need help from me, connect me directly please.

Huji updated the task description. (Show Details)Jun 28 2017, 11:42 PM

@Huji I'd suggest you adding a filter in the Persian Wikipedia to prevent more than a few amount of entities until this task is resolved, may that happen?

@MohammadtheEditor that I have already done. But I still want t his resolved as it can be abused in other projects similarly.

Massive attacks are happening in Persian Wikipedia while the troll uses Persian unbreakable space (‌) to get around the abuse filter. This is being used widely by the troll and there isn't a local way around it. I'm changing the priority to the highest since this is becoming an urgent.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJul 9 2017, 10:48 AM

(Reverting priority change. As annoying and distracting as this is, it does not qualify for "immediately drop anything else you work on".)

I don't know how useful this can be with Persian but ccnorm function decodes HTML entities before similar characters are converted.

In T169122#3420585, @matej_suchanek wrote:

I don't know how useful this can be with Persian but ccnorm function decodes HTML entities before similar characters are converted.

That decodes things like < to < but it does not transcode things like پ to the letter ت . Interestingly, the command [[پرونده: is inerpretted as [[تصویر: (which is localize for [[Image:) and the image is shown! What it means to me is that MW's parser already can transcode these. I just need to figure out where, and then expose that function to AbuseFilter.

>>> Sanitizer::decodeCharReferences('&#1662;&#1585;&#1608;&#1606;&#1583;&#1607;');
=> "پرونده"

@Legoktm should we just modify ccnorm() to pass the text through decodeCharReferences?

matej_suchanek moved this task from Backlog to Filtering features on the AbuseFilter board.Jan 19 2018, 7:52 PM

Change 406534 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AntiSpoof@master] Remove invisible characters and normalize HTML entities

https://gerrit.wikimedia.org/r/406534

gerritbot added a project: Patch-For-Review.Jan 28 2018, 11:00 PM

Huji claimed this task.Jan 28 2018, 11:00 PM

Huji added projects: User-Huji, AntiSpoof.

Huji updated the task description. (Show Details)Jan 31 2018, 12:12 AM

Change 406966 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AbuseFilter@master] Intorducing santize() function

https://gerrit.wikimedia.org/r/406966

Change 406534 abandoned by Huji:
Remove invisible characters and normalize HTML entities

Reason:
Not needed anymore.

https://gerrit.wikimedia.org/r/406534

Huji removed a project: AntiSpoof.Mar 4 2018, 10:19 PM

Huji updated the task description. (Show Details)

Change 406966 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Introduce sanitize() function

https://gerrit.wikimedia.org/r/406966

Restricted Application added a subscriber: Daimona. · View Herald TranscriptJun 24 2018, 1:53 PM

Huji closed this task as Resolved.Jun 24 2018, 3:38 PM

Daimona removed a project: Patch-For-Review.Jun 24 2018, 4:31 PM

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)).Jun 30 2018, 3:01 PM

matej_suchanek unsubscribed.Jul 11 2018, 2:21 PM

Create a function for AbuseFilter that can normalize HTML entities to their respective UTF-8 charactersClosed, ResolvedPublicActions

Description

Details

Event Timeline

Create a function for AbuseFilter that can normalize HTML entities to their respective UTF-8 characters
Closed, ResolvedPublic
Actions