Create a special page to handle additions, removals, changes and logging of spam blacklist entries
Open, Needs TriagePublic

Description

There should be a special page to manage the spam blacklist. Admins should be able to check URLs against the blacklist. I had an URL that was blacklisted and I could not find out which regular expression matched it (ok, I could write a simple perl script to to this on my computer). There are some other related suggestions about spam, see:

Maybe you should rewite the entire Spam protection mechanism.

Details

Reference
bz4459
bzimport raised the priority of this task from to Low.
bzimport set Reference to bz4459.
bzimport added a subscriber: Unknown Object (MLST).

I finally found that www.2100books.com matches against [0-9]+books\.com. So how
do I know which [0-9]+books\.com pages are good and which are evil? Who entered
the regexp because of which pages? Maybe you can find out in the version history
but managing the spam blacklist should better be as easy as blocking users and
pages.

accnospamtom wrote:

A rewritten "Spam protection mechanism" should definetly be part
of each Mediawiki-installation. Many sites use this software, but
they don't install optional extensions. Sysops have to fight
wikispam without proper "weapons", as they usually have no access
to the servers.

This feature should be enabled by default (with an empty spam-
blacklist, which is editable by sysops.

robchur wrote:

Setting product correctly. I have seen a demand for a slightly easier-to-use
version of Spam Blacklist, so it might not be a bad idea to consider this a
separate request. Leaving ambiguous for now.

brian wrote:

(In reply to comment #0)

http://bugzilla.wikimedia.org/show_bug.cgi?id=1505

Or just bug 1505 - it automatically creates a link.

robchur wrote:

*** Bug 4698 has been marked as a duplicate of this bug. ***

mike.lifeguard+bugs wrote:

*** Bug 13805 has been marked as a duplicate of this bug. ***

mike.lifeguard+bugs wrote:

*** Bug 14090 has been marked as a duplicate of this bug. ***

mike.lifeguard+bugs wrote:

If SpamRegex is fixed up, it might fulfil this need; see bug 13811.

mike.lifeguard+bugs wrote:

(In reply to comment #8)

If SpamRegex is fixed up, it might fulfil this need; see bug 13811.

Per bug 13811 comment 14, that's apparently not true. This will probably be fulfilled by AbuseFilter, which Werdna is working on, so I've CCed him.

seth added a comment.Feb 19 2009, 7:12 PM

We don't have a special page yet, but there are tools like http://toolserver.org/~seth/grep_regexp_from_url.cgi which give the possibility to search for a entry and for its reason. This toll can be used in MediaWiki:Spamprotectionmatch, e.g., http://de.wikipedia.org/wiki/MediaWiki:Spamprotectionmatch/en.

So afaics the main thing - which was the difficulty in finding already blacklisted links - is solved.

mike.lifeguard+bugs wrote:

(In reply to comment #10)

We don't have a special page yet, but there are tools like
http://toolserver.org/~seth/grep_regexp_from_url.cgi which give the possibility
to search for a entry and for its reason. This toll can be used in
MediaWiki:Spamprotectionmatch, e.g.,
http://de.wikipedia.org/wiki/MediaWiki:Spamprotectionmatch/en.

So afaics the main thing - which was the difficulty in finding already
blacklisted links - is solved.

External tools are *not* sufficient.

mike.lifeguard+bugs wrote:

There are probably-useful notes on http://www.mediawiki.org/wiki/Extension_talk:SpamBlacklist#more_detailed_manual_and_suggestions and certainly on http://www.mediawiki.org/wiki/Regex-based_blacklist

Both AbuseFilter and SpamRegex would need lots of work to be a viable alternative to SpamBlacklist at present. Some of the major concerns with replacing SpamBlacklist with AbuseFilter follow (concerns regarding replacing SpamBlacklist with SpamRegex are discussed on bug 13811):

*Global filters (bug 17811) are really required since probably 1/3 our spam blocking as a Wikimedia community happens globally.
**Relatedly, local wikis would need some way to opt-out of blocking individual domains (or individual filters - and you might block multiple domains with a single filter - we do use regex after all :D)

*Also relatedly, we need to output for non-WMF wikis - but only the spam-related filters! So, probably some method of categorizing them will be necessary. That'd also be useful since if you have several thousand filters, it will quickly become *very* difficult to search through them all for a particular one - tagging/categorizing of filters and searching within the notes will be needed.
**As well, this assumes that all third parties will install AbuseFilter - which will not happen. So, ideally there would be a compatibility function to provide output at least somewhat equivalent to the output of SpamBlacklist which could be used as input for third party installations.

*Regarding workflow: AbuseFilter is not designed for blocking spam (it is meant to target pattern vandalism), and the workflow reflects that. We need to be able to quickly and painlessly add formulaic filters which do a very small subset of what AbuseFilter is capable of. I had suggested in the past that there could be filter templates for common purposes (such as blocking spam) - users would just fill in the blank and apply the filter.

*Performance: Someone should compare the performance effects of blocking all the domains we're currently blocking with SpamBlacklist using AbuseFilter instead (using one filter for each line of regex vs one filter for the whole thing would also be a useful comparison - is there an impact there? That could affect workflow significantly depending on the answer.)

*AbuseFilter can resolve bug 16325 in a user-friendly way: If all_links has whatever.com then present a particular message asking them to remove it (but potentially let them still save the edit or not, depending)

*For authors, showing the edit form after a hit (bug 16757) is important & AbuseFilter would resolve that.

*The AbuseFilter log would resolve bug 1542 nicely (& we are even replicating that to the toolserver).

*Rollback can be exempted easily, which would resolve bug 15450 perfectly.

*AbuseFilter can use new_html to resolve bug 15582 somewhat at least -- someone should figure out how true that statement is, since I'm no expert there. Potentially bug 16610 too?

*If AbuseFilter were modified, it could potentially resolve bug 16466 in an acceptable manner. Bug 14114 too?

*AbuseFilter could potentially resolve bug 16338 and bug 13599, depending on how one sets up the filters.

*AbuseFilter could maybe be modified to allow per-page exceptions (bug 12963)... something like an whitelist filter? Or you could mash that into the original filter, which goes back to the workflow problem.

*AbuseFilter's ccnorm() and/or rmspecials() would resolve the unicode problem (bug 12896) AFAICT -- though that should certainly be tested & verified.

*AbuseFilter's warn function would resolve bug 9416 in a very user-friendly manner.


In summation: AbuseFilter needs to implement global filters, local exemption, backward compatibility with SpamBlacklist on third-party installs, better filter tagging/searching and other workflow improvements before it can be considered a viable alternative to SpamBlacklist.

What about mw:Extension:Phalanx? Looks like a good tool.

CCing Jack Phoenix as he seems in charge of Phalanx.

hoo added a subscriber: hoo.EditedJan 10 2015, 12:58 PM

What about mw:Extension:Phalanx? Looks like a good tool.

Phalanx has quite some redundancy with tools we already have (especially AbuseFilter), also it has quite some rough edges AFAIR.

In T6459#968333, @hoo wrote:

Phalanx has quite some redundancy with tools we already have (especially AbuseFilter), also it has quite some rough edges AFAIR.

This. Performance is also a big deal, and when it comes to Phalanx, it's just...not good. Which, I guess, in part explains why Wikia rewrote parts of Phalanx backend in Scala last year (see https://github.com/Wikia/scala-backend/tree/master/phalanx). Most pre-existing tools -- namely AbuseFilter, GlobalBlocking, SpamBlacklist & TitleBlacklist -- handle most of the tasks Phalanx does, too.

It's probably worth noting that SpamBlacklist hits (user X triggered spam filter on page Y) have been logged to a log viewable on Special:Log since Q3 2013 (see acaf4262d94269e55f9ac45179fc7159c961e346).

Anyway, as a final note on Phalanx, we should be working on improving pre-existing tools to make it redundant, specifically adding account blocking support to GlobalBlocking and whatnot else, but that's a whole different task not relevant to this report.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2015, 4:59 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 7:12 PM
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptFeb 22 2016, 7:12 PM
MarcoAurelio raised the priority of this task from Low to Needs Triage.Sep 20 2016, 6:36 AM
MarcoAurelio updated the task description. (Show Details)
seth added a comment.Sep 20 2016, 8:22 AM

as said already (above in this thread), you could use

https://tools.wmflabs.org/searchsbl

But this is, of course, just an external tool.

MarcoAurelio added a subscriber: nichtich.

User renamed.

I think this will be a more modern way to handle URL blacklisting. I very much support this.

I am going to work out some thought experiment here. My suggestion to re-write the current spam-blacklist extension (or better, rewrite another extension):

  • take the current AbuseFilter, take out all the code that interprets the rule ('conditions').
  • Make 2 fields:
    • one text field for regexes that block added external links (the blacklist). Can contain many rules (one on each line).
    • one text field for regexes that override the block (whitelist overriding this blacklist field; that is generally simpler and cleaner than writing a complex regex, not everybody is a specialist on regexes).
  • Add namespace choice (checkboxes; so one can choose not to blacklist something in one particular namespace, or , with addition of an 'all', a 'content-namespace only' and 'talk-namespace only'.
  • Add user status choice (checkboxes for the different roles, or like the page-protection levels)
    • Some links are fine in discussions but should not be used in mainspace, others are a total nono
    • Some image links are find in the file-namespace to tell where it came from, but not needed in mainspace
  • Leave all the other options:
    • Discussion field for evidence (or better, a talk-page like function)
    • Enabled/disabled/deleted - not needed, turn it off, obsolete then delete
    • 'Flag the edit in the edit filter log' - maybe nice to be able to turn it off, to get rid of the real rubbish that doesn't need to be logged
    • Rate limiting - catch editors that start spamming an otherwise reasonably good link
    • Warn - could be a replacement for en:User:XLinkBot
    • Prevent the action - as is the current blacklist/whitelist function
    • Revoke autoconfirmed - make sure that spammers are caught and checked
    • Tagging - for combining certain rules to be checked by RC patrollers.
    • I would consider to add a button to auto-block editors on certain typical spambot-domains.

This should overall be much more lightweight than the current AbuseFilter (all it does is regex-testing as the spam-blacklist does, only it has to cycle through maybe thousands of AbuseFilters)

One could consider to expand it to have rules blocked or enabled on only certain pages, but that sounds complicated to me.

I know that this functionality is there in the current AbuseFilter, but running many regexes using AbuseFilter on every edit is going to be a burden for the servers, this should not be significantly more heavy than the current Spam Blacklist.

Ladsgroup added a subscriber: Ladsgroup.

Yes please.

Liuxinyu970226 awarded a token.
1997kB added a subscriber: 1997kB.Sun, Dec 2, 4:13 AM