An AbuseFilter to prevent a user based on keyword in his edits was evaded by encoding the text into HTML entities.
We should add a function that normalizes any string into its UTF-8 equivallent
Here's a link of the vandalism
An AbuseFilter to prevent a user based on keyword in his edits was evaded by encoding the text into HTML entities.
We should add a function that normalizes any string into its UTF-8 equivallent
Here's a link of the vandalism
Marking as high as I think this can be abused at large scale.
Also suggesting that the visibility level of this task be changed (hidden from the public) due to potential for abuse. Once change, I won't be able to see the details of this task any more, so if you need help from me, connect me directly please.
@Huji I'd suggest you adding a filter in the Persian Wikipedia to prevent more than a few amount of entities until this task is resolved, may that happen?
@MohammadtheEditor that I have already done. But I still want t his resolved as it can be abused in other projects similarly.
Massive attacks are happening in Persian Wikipedia while the troll uses Persian unbreakable space () to get around the abuse filter. This is being used widely by the troll and there isn't a local way around it. I'm changing the priority to the highest since this is becoming an urgent.
(Reverting priority change. As annoying and distracting as this is, it does not qualify for "immediately drop anything else you work on".)
I don't know how useful this can be with Persian but ccnorm function decodes HTML entities before similar characters are converted.
That decodes things like < to < but it does not transcode things like پ to the letter ت . Interestingly, the command [[پرونده: is inerpretted as [[تصویر: (which is localize for [[Image:) and the image is shown! What it means to me is that MW's parser already can transcode these. I just need to figure out where, and then expose that function to AbuseFilter.
>>> Sanitizer::decodeCharReferences('پرونده'); => "پرونده"
@Legoktm should we just modify ccnorm() to pass the text through decodeCharReferences?
Change 406534 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AntiSpoof@master] Remove invisible characters and normalize HTML entities
Change 406966 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AbuseFilter@master] Intorducing santize() function
Change 406534 abandoned by Huji:
Remove invisible characters and normalize HTML entities
Reason:
Not needed anymore.
Change 406966 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Introduce sanitize() function