Page MenuHomePhabricator

Make SpamBlacklist clearly act prior to AbuseFilter on WMF operational wikis
Closed, ResolvedPublic

Description

Background:
Please take these as my studied observations of operations, though not with any technical background. Also noting that these observations are made from WMF wikis, viewing these through the perspective of an administrator at Metawiki, and having global sysop privileges.

Observations:
We have edits that hit both global abuse filters (m:Special:Abuselog and local equivalents) and also appear in local Special:log/spamblacklist.

I would also note that I don't believe this it is always the case in that I believe that sometimes attempts will only log in log/spamblacklist and not hit abusefilters. (I have not tried to test this anywhere)

I queried this years ago, and was informed that the spam blacklist extension loads before the abusefilters extension, so the blacklisting should occur first and obviate the need for action in abusefilters.)

Issue:
For the best use of abusefilters,, having known and already blocked spam appearing in the abuselogs is simply problematic both in terms of significant abuselog noise, but also from knowing whether an edit has hit or not. As an administrator, especially in SWMT, I would prefer to only have to chase down potential spam, not follow known blocked edits

So I am asking that there is some modification so that the spam blacklist is able to act significantly before the abuse filters so that we are not getting this overlap.

Alternatively, is there a means that we can tune abuse filters so that they can ignore edits from urls with domains in the global spam blacklists, knowing that extension spam blacklist is reliable. So some sort of masking faculty.

cc. @Beetstra @MarcoAurelio @MusikAnimal @Legoktm @Chrissymad

Thanks for your consideration

Event Timeline

Aklapper renamed this task from Having spam blacklist clearly acting prior to abuse filters (WMF operational wikis) to Make SpamBlacklist clearly act prior to AbuseFilter on WMF operational wikis.Dec 11 2018, 12:38 PM

IIRC, hooks are executed in the order that they are registered, and AbuseFilter should come first due to alphabetical order; that is why AbuseFilter runs before SpamBlacklist. I'm not sure whether we can change the order (and if we really want it). As for the second proposal, well: for sure we cannot hardcode anything which completely skips AbuseFilter if the edit contains a blacklisted link. We could instead add a variable containing the links on SpamBlacklist so that people may use it on a per-filter basis.

This month is truly a good indicator of how we could make abuse filters useful instead of clagged with superfluous, duplicated hits.

FWIW the spambots are incredibly busy and our base defences seem pissweak, and the reliance on abusefilters and spamblacklist seems troublesome.

@Billinghurst I can try to add the new variable, I think that would be a reasonably quick solution. FWIW I can take a look at the involved filters and try to help directly on-wiki, but unfortunately this isn't a good period :/

Change 481246 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/SpamBlacklist@master] [WIP] Add an AbuseFilter variable with the content of the spam blacklist

https://gerrit.wikimedia.org/r/481246

Change 481572 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add a new method and hook for static variables

https://gerrit.wikimedia.org/r/481572

Change 481572 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add a new method and hook for static variables

https://gerrit.wikimedia.org/r/481572

Will this be a cheap or an expensive call? Trying to work out where it should be placed in the coding sequence?

Do we also need to book it in for the Tech newsletter?

I am assuming that the variable name is spam-blacklist

Will this be a cheap or an expensive call? Trying to work out where it should be placed in the coding sequence?

It depends, mostly on the blacklist size. Since this variable includes the whole blacklist (i.e. includes the shared one), it'll be pretty large. However, the expensiveness only counts when examinating old entries: during edits, the content of the blacklist is saved in cache and retrieved from there; so, using it once or twice almost makes no difference (if I'm not mistaken).

Do we also need to book it in for the Tech newsletter?

I'd say yes, but the variable still has to be added.

I am assuming that the variable name is spam-blacklist

spam_blacklist, with the underscore.

In the meanwhile, there are two other problems to address. The first one is that, when examinating old edits, this variable will hold the current content of the blacklist, and not the one at the time of the edit. This could be a problem, but there's no easy solution. At least, SpamBlacklist should provide a way to efficiently retrieve an old version of the blacklist, and AFAICS it currently doesn't.
The second one is that currently AbuseFilter lacks a clean way to use this variable: spam_blacklist is an array of strings, and arrays are poorly implemented in AF. A workaround should be something like added_lines irlike str_replace(string(spam_blacklist),'\n','|') but I wouldn't rely on it.

Change 424298 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add array-specific functions

https://gerrit.wikimedia.org/r/424298

Change 424298 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Add array-specific functions

https://gerrit.wikimedia.org/r/424298

Change 481246 abandoned by Daimona Eaytoy:
[mediawiki/extensions/SpamBlacklist@master] Add an AbuseFilter variable with the content of the spam blacklist

Reason:
Not working on this.

https://gerrit.wikimedia.org/r/481246

On second thought, I'm not entirely convinced by the spam_blacklist variable. Firstly because AbuseFilter lacks a good way to manipulate arrays (although that could be resolved by the patch above). Secondly because it might be overly expensive. And third, because this variable is so massively large that it would clog the details view of AbuseLog entries.

Instead, what about adding a hack for WMF sites only? Perhaps unsetting the AbuseFilter handler from wgHooks and adding it back at the end? It would be hacky, but perhaps it might work, given that we have no clean way of specifying the "priority" of a hook handler, AFAIK.

IIRC, hooks are executed in the order that they are registered, and AbuseFilter should come first due to alphabetical order; that is why AbuseFilter runs before SpamBlacklist.

No, SpamBlacklist runs first, because it is first in CommonSettings.php. The extension queue is not sorted.

> print_r(MediaWiki\MediaWikiServices::getInstance()->getHookContainer()->getLegacyHandlers('EditFilterMergedContent'));
Array
(
    [0] => Closure Object
...
    [1] => SpamBlacklistHooks::filterMergedContent
    [2] => GadgetHooks::onEditFilterMergedContent
    [3] => ConfirmEditHooks::confirmEditMerged
    [4] => MediaWiki\Extension\AbuseFilter\AbuseFilterHooks::onEditFilterMergedContent
    [5] => CampaignHooks::onEditFilterMergedContent
    [6] => TranslateHooks::validateMessage
    [7] => ScribuntoHooks::validateScript
    [8] => JsonSchemaHooks::onEditFilterMergedContent
    [9] => JsonConfig\JCSingleton::onEditFilterMergedContent
    [10] => NewsletterHooks::onEditFilterMergedContent
    [11] => PageTranslationHooks::tpSyntaxCheckForEditContent
)

SpamBlacklist, like AbuseFilter, does not return false when rejecting an edit, and so does not abort hook processing.

A simple fix would be to check the Status object to see if SpamBlacklist has already rejected the change:

diff --git a/includes/AbuseFilterHooks.php b/includes/AbuseFilterHooks.php
index 10d4ddfa..cbaf383c 100644
--- a/includes/AbuseFilterHooks.php
+++ b/includes/AbuseFilterHooks.php
@@ -169,6 +169,11 @@ class AbuseFilterHooks {
 	) {
 		$startTime = microtime( true );
 
+		if ( !$status->isOK() ) {
+			// Don't filter the edit if it is already rejected by another extension (T211680)
+			return;
+		}
+
 		$filterResult = self::filterEdit( $context, $user, $content, $summary, $slot );
 
 		if ( !$filterResult->isOK() ) {

(untested)

IIRC, hooks are executed in the order that they are registered, and AbuseFilter should come first due to alphabetical order; that is why AbuseFilter runs before SpamBlacklist.

No, SpamBlacklist runs first, because it is first in CommonSettings.php. The extension queue is not sorted.

Oh thank you, I was misremembering.

A simple fix would be to check the Status object to see if SpamBlacklist has already rejected the change:

Seems doable, perhaps we should start with some debug logging to see what's being skipped.

Change 661543 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add debug logging for edits presumably prevented by other extensions

https://gerrit.wikimedia.org/r/661543

Change 661543 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Add debug logging for edits presumably prevented by other extensions

https://gerrit.wikimedia.org/r/661543

Quick sitrep: since the introduction of this change (around Feb 10), there've been roughly 7200 log entries. For all of them, the "fail reason" is "spam-blacklisted-link", i.e. a SpamBlacklist hit. This makes me quite confident that implementing this change would achieve what the task description asks for, without unwanted side effects (probably).

Change https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SpamBlacklist/+/680687 by @Func, which I have just approved (and which is part of a cleanup to make all extensions consistent), will likely have the desired effect here – AbuseFilter's hook will not run if SpamBlacklist's hook has already rejected the edit.

We are again being hammered by spambots, and they are again clogging up the Abuselog despite them being blacklisted days or weeks ago. Could we please see if there could be implementation of the suggested change. There is nothing more despairing in doing the hard work to get controls in place and having to have abuselogs bombed with the spambots. Thanks.

Does this mean that the change MatmaRex linked above made little to no difference here? It should have been live for ~7 branches now.

I am still seeing plentiful spambots in the abuselog. so yes, little to no difference

I think what is happening is that they are hitting the Abusefilter first, and not then hitting the spam blacklist

https://meta.wikimedia.org/wiki/Special:AbuseFilter/examine/log/1408133 spam by user:AngeloCarlton4

No log/spamblacklist entry for AngeloCarlton4
I try from an IP account for a post with just that same IP url and it is blocked.

Does this mean that the change MatmaRex linked above made little to no difference here? It should have been live for ~7 branches now.

I think that change would indeed work in theory. However, r669364 switched SpamBlacklist to use new-style hooks, whereas AF is still using the legacy ones. Legacy handlers are executed before new-style ones, hence AF is now running before SB.

Change 713046 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/AbuseFilter@master] Switch remaining hooks to the new system

https://gerrit.wikimedia.org/r/713046

Does this mean that the change MatmaRex linked above made little to no difference here? It should have been live for ~7 branches now.

I think that change would indeed work in theory. However, r669364 switched SpamBlacklist to use new-style hooks, whereas AF is still using the legacy ones. Legacy handlers are executed before new-style ones, hence AF is now running before SB.

Yay, regression?

Does this mean that the change MatmaRex linked above made little to no difference here? It should have been live for ~7 branches now.

I think that change would indeed work in theory. However, r669364 switched SpamBlacklist to use new-style hooks, whereas AF is still using the legacy ones. Legacy handlers are executed before new-style ones, hence AF is now running before SB.

Yay, regression?

Well, technically speaking it never worked, since the switch to new-style hooks happened before the patch with proper handling of the EditFilterMergedContent hook.

Change 713046 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Switch filterable actions hooks to the new system

https://gerrit.wikimedia.org/r/713046

This should be resolved now, waiting for this week's train to confirm that.

This should be resolved now, waiting for this week's train to confirm that.

I think this missed .19, so unless it's backported, it won't be out till .20

This should be resolved now, waiting for this week's train to confirm that.

I think this missed .19, so unless it's backported, it won't be out till .20

Indeed. I'd rather not backport it -- there shouldn't be any functional change, but it's still nontrivial, and not super-urgent either. So... RemindMe! Two weeks.

group0 is on .20 now, so I wanted to confirm whether my patch actually fixes the issue, and it does:

>>> MediaWiki\MediaWikiServices::getInstance()->getHookContainer()->getHandlers('EditFilterMergedContent');
=> [
     SpamBlacklistHooks {#3208},
     MediaWiki\Extension\AbuseFilter\Hooks\Handlers\FilteredActionsHandler {#3206},
     MediaWiki\Extension\Translate\MessageBundleTranslation\Hooks {#3205},
     GrowthExperiments\Config\ConfigHooks {#5491},
   ]

I'll check again on a group2 wiki once .20 gets there and then close this as resolved.

group0 is on .20 now, so I wanted to confirm whether my patch actually fixes the issue, and it does:

>>> MediaWiki\MediaWikiServices::getInstance()->getHookContainer()->getHandlers('EditFilterMergedContent');
=> [
     SpamBlacklistHooks {#3208},
     MediaWiki\Extension\AbuseFilter\Hooks\Handlers\FilteredActionsHandler {#3206},
     MediaWiki\Extension\Translate\MessageBundleTranslation\Hooks {#3205},
     GrowthExperiments\Config\ConfigHooks {#5491},
   ]

I'll check again on a group2 wiki once .20 gets there and then close this as resolved.

Seeing similar results on meta and enwiki, hence calling this resolved.