Page MenuHomePhabricator

Move all the functionality of {Spam,Title}Blacklist extensions into AbuseFilter and retire them
Open, Needs TriagePublic

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenNone
OpenNone
ResolvedDaimona
ResolvedDaimona
ResolvedDaimona
ResolvedDaimona
ResolvedUrbanecm
DeclinedDaimona
ResolvedDaimona
ResolvedDaimona
ResolvedDaimona
ResolvedDaimona
Resolvedmatej_suchanek
ResolvedDaimona
ResolvedDaimona
Resolvedmatej_suchanek
Resolvedmatej_suchanek
ResolvedPRODUCTION ERRORDaimona
ResolvedDaimona
ResolvedUmherirrender
ResolvedDaimona
Resolved Marostegui
Resolved Bstorm
ResolvedDaimona
ResolvedUrbanecm
Resolved Marostegui
Resolvedrook
OpenFeatureNone
OpenLadsgroup
OpenNone
OpenNone

Event Timeline

Would like to understand to what "functionality" refers. The blacklists/whitelists as lists are very simple, abusefilters are not so.

Would like to understand to what "functionality" refers.

Their existence.

The blacklists/whitelists as lists are very simple, abusefilters are not so.

AF allows much more control, that's true, but to my mind providing a simple single-page editing paradigm as currently provided is sufficiently similar that we can lift+shift for now, and then possibly re-factor later over the next few years (e.g. logging of hit rates, thresholds for activity, more complex response options than pass/fail, CheckUser integration, etc.) if there's demand.

Would like to understand to what "functionality" refers.

Their existence.

The blacklists/whitelists as lists are very simple, abusefilters are not so.

AF allows much more control, that's true, but to my mind providing a simple single-page editing paradigm as currently provided is sufficiently similar that we can lift+shift for now, and then possibly re-factor later over the next few years (e.g. logging of hit rates, thresholds for activity, more complex response options than pass/fail, CheckUser integration, etc.) if there's demand.

I suppose that I am seeing differences with globality. The blacklists are universal for WMF, though implementation of AFs while being universal, the checks are not global in impact (we have global AF that do not target large wikis). So we have some "language" issues to address.

Noting that there would need to be significant tuning work on the logging as something like Special:Abuselog for global AF is bad enough as it is without including blacklist hits which are currently only locally logged.

Also there is still the issue that "title blacklist" is not logged locally and globally, and that has upsides and downsides.

(Of course, all that is detail and probably does not belong here at the top level, just what my brain is contemplating on the immediate.)

In my opinion AF is intended to be deployed to all wikis even including larger one like enwiki, but wikis may choose to opt out some specific filters or opt out all by default and opt in specific one (both are currently not possible, see T45761: Allow local disabling of global AbuseFilters), based on local consensus - otherwise the list of wikis to opt-out is very random as large wikis are not always more active ones (they are only large in database size).

TitleBlacklist and SpamBlacklist currently use a wiki page to store their contents. Eventually they should be switched to databases (performance should be considered);

Issues that may be solved easier - T38940, T6459, T14963, T27524(*), T75417, T216803
Issues that may be closed - T209806
(*) Most SBL items may be converted to a linksearch-like syntax (org.wikipedia.en/...)

I'm not sure what this task is proposing. Technically the functionality from spam/title blacklist already exists in AbuseFilter, it would be trivial to write a filter which blocks certain links from being added or pages from being created with certain titles.

That aside, I'm not convinced this is the best approach. We have tens of thousands of blacklisted URLs across Wikimedia projects. That's not feasible to include in one filter, nor is it feasible to create individual filters for each URL. The functionality we need to block spam URLs is relatively limited (though there's certainly room to expand on the current all-or-nothing approach), whereas AbuseFilter is a deeply customisable toolset with far too much going on for the relatively simple task of blocking certain URLs.

I'm not sure what this task is proposing. Technically the functionality from spam/title blacklist already exists in AbuseFilter, it would be trivial to write a filter which blocks certain links from being added or pages from being created with certain titles.

Yes, hence "the functionality", not "the equivalent functionality". The latter already exists.

Change 692740 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/AbuseFilter@master] [WIP] Import the SpamBlacklist and TitleBlacklist extensions as "SimpleList"

https://gerrit.wikimedia.org/r/692740

Change 692749 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] Zuul: [mediawiki/extensions/AbuseFilter] Add Scribunto & EventLogging deps

https://gerrit.wikimedia.org/r/692749

I don't think merging the extensions is a good idea. AbuseFilter is already incredibly complex (for good reason). SpamBlacklist and TitleBlacklist are both straightforward to use (a list of regexes), work out of the box, bundled extensions. AbuseFilter still isn't eligible for bundling yet.

That said, if the goal of this ticket is to re-create Phalanx I'm all for it.

Change 692749 merged by jenkins-bot:

[integration/config@master] Zuul: [mediawiki/extensions/AbuseFilter] Add Scribunto & EventLogging deps

https://gerrit.wikimedia.org/r/692749

Mentioned in SAL (#wikimedia-releng) [2021-05-19T16:42:28Z] <James_F> Zuul: [mediawiki/extensions/AbuseFilter] Add Scribunto & EventLogging deps T279275

I don't think merging the extensions is a good idea. AbuseFilter is already incredibly complex (for good reason). SpamBlacklist and TitleBlacklist are both straightforward to use (a list of regexes), work out of the box, bundled extensions.

SpamBlacklist and TitleBlacklist are both limited in what they can do. They have terrible interfaces. They don't get integrated with new action types. The complexity of AF is optional and not required.

AbuseFilter still isn't eligible for bundling yet.

We'll be bundled by the time 1.37 ships.

I don't think merging the extensions is a good idea. AbuseFilter is already incredibly complex (for good reason). SpamBlacklist and TitleBlacklist are both straightforward to use (a list of regexes), work out of the box, bundled extensions. AbuseFilter still isn't eligible for bundling yet.

That said, if the goal of this ticket is to re-create Phalanx I'm all for it.

I mostly second this comment, but I feel the need to expand on a point in particular. Technically speaking, AbuseFilter should already be capable of everything that Spam/TitleBlacklist can do. The difference is that AbuseFilter is much more complex, in that it allows splitting rules into filters, adding lots of conditions and different consequences, all with a visual interface (not with things like <noedit | autoconfirmed |errmsg=titleblacklist-custom-msg>). TB/SB are probably meant to be a lightweight alternative that doesn't require to learn a scripting language and code fine-grained checks. I think merging everything might be fine, but ideally we'd want to do a bit more than just combine the code.

It's also unclear how the SP/TB code would integrate with AF. E.g. how would they interact with the DB schema? Just migrating the special pages as they are doesn't seem useful. The other possibility I can think of is having a "special" filter for the TB (same for the SB). But then I don't think we'd have to import any code, as this can already be done on-wiki. Another thing to keep in mind is that TB/SB have everything in a single page, and each regex can specify what consequences should be taken. This cannot be preserved in AF, i.e. every filter has a fixed set of consequences.

Long story short, I think having a single, centralized tool might be a good idea, but I currently can't think of a way that makes sense.

I don't think merging the extensions is a good idea. AbuseFilter is already incredibly complex (for good reason). SpamBlacklist and TitleBlacklist are both straightforward to use (a list of regexes), work out of the box, bundled extensions. AbuseFilter still isn't eligible for bundling yet.

That said, if the goal of this ticket is to re-create Phalanx I'm all for it.

I mostly second this comment, but I feel the need to expand on a point in particular. Technically speaking, AbuseFilter should already be capable of everything that Spam/TitleBlacklist can do. The difference is that AbuseFilter is much more complex, in that it allows splitting rules into filters, adding lots of conditions and different consequences, all with a visual interface (not with things like <noedit | autoconfirmed |errmsg=titleblacklist-custom-msg>). TB/SB are probably meant to be a lightweight alternative that doesn't require to learn a scripting language and code fine-grained checks. I think merging everything might be fine, but ideally we'd want to do a bit more than just combine the code.

It's also unclear how the SP/TB code would integrate with AF. E.g. how would they interact with the DB schema? Just migrating the special pages as they are doesn't seem useful. The other possibility I can think of is having a "special" filter for the TB (same for the SB). But then I don't think we'd have to import any code, as this can already be done on-wiki. Another thing to keep in mind is that TB/SB have everything in a single page, and each regex can specify what consequences should be taken. This cannot be preserved in AF, i.e. every filter has a fixed set of consequences.

Long story short, I think having a single, centralized tool might be a good idea, but I currently can't think of a way that makes sense.

Per my commit message, I was thinking of phases:

  1. Move the current functionality into the repo as-is (this task)
  2. Change the editing experience into a visual editing experience that's simpler than learning regex or scripting language (T6459)
  3. Change the storage from a simple page into a DB table (T279476 and T279477)

At that point, we'd have the ability to fuse the different sources of Filters into different types of filter with different abilities, whilst being consistent about e.g. Unicode normalisation, or triggering actions, or so on.

I don't think merging the extensions is a good idea. AbuseFilter is already incredibly complex (for good reason). SpamBlacklist and TitleBlacklist are both straightforward to use (a list of regexes), work out of the box, bundled extensions.

SpamBlacklist and TitleBlacklist are both limited in what they can do.

This is sometimes a feature, but yes. I think Wikimedia still doesn't have fully global AbuseFilters but SpamBlacklist and TitleBlacklist are fully global.

They have terrible interfaces. They don't get integrated with new action types. The complexity of AF is optional and not required.

Agreed on this. I just don't see how wholesale moving the code into the AbuseFilter repo is a good idea on how to fix these problems. I think it would be better to do an analysis of the features of each extension, figure out how they integrate with AF, and then add that functionality...not just copy code around.

+1 to everything Daimona said.

Long story short, I think having a single, centralized tool might be a good idea, but I currently can't think of a way that makes sense.

I never used Phalanx but the big limitation of SB/TB is you can't add additional conditions based on the regex nor can you pick other consequences besides disallow. So it would be nice if a title was warning only for < 50 edits users. Or something. And the AbuseLog is way more rich than the very limited SpamBlacklist log we have (and that's a recentish thing too). But managing giant regexes in AbuseFilter is a pain, so SB gets plenty of use that way. I say this as a person who was deeply involved in AF/SB/TB around 2013-2016 but hasn't done much since, so it's possible I'm out of date!

What is the functionality unavailable in AbuseFilter that would make it on par with *Blacklist functionality? Off the top of my head these come to mind:

  • a function to take a (possibly foreign) wiki page with a list of regexes and match the other input against it (optionally case-insensitively)
  • tagging of the regex list entires with user rights and such - that can be replaced with a separate regexlist page for every flag, for some loss of usability
  • intelligent error reporting (I want to know which regex matched)
  • performance (filters are disabled if they match too often, which is not ideal for an anti-spam feature)

The first seems easy to do, the others not so much. I agree that moving the current code/funcionality into AbuseFilter as-is doesn't seem useful.

What is the functionality unavailable in AbuseFilter that would make it on par with *Blacklist functionality? Off the top of my head these come to mind:

  • a function to take a (possibly foreign) wiki page with a list of regexes and match the other input against it (optionally case-insensitively)
  • tagging of the regex list entires with user rights and such - that can be replaced with a separate regexlist page for every flag, for some loss of usability
  • intelligent error reporting (I want to know which regex matched)
  • performance (filters are disabled if they match too often, which is not ideal for an anti-spam feature)

The first seems easy to do, the others not so much. I agree that moving the current code/funcionality into AbuseFilter as-is doesn't seem useful.

"Move everything as-is" means you don't have to do an endless consultation with 1000 wikis' worth of sysops as nothing changes for them; giving them the ability to slowly migrate to proper Filters after as-is, without breaking existing workflows, is exceptionally valuable.

Moving everything as-is messes up git histories, it makes the organization of the code less logical, and it is functionally equivalent to doing nothing, so doing nothing seems preferable to me.

Change 922929 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/AbuseFilter@master] Introduce Special:BlockedExternalDomains

https://gerrit.wikimedia.org/r/922929

Change 922929 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Introduce Special:BlockedExternalDomains

https://gerrit.wikimedia.org/r/922929

Change 925767 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/AbuseFilter@master] BlockedExternalDomains: Make this a special right, prohibit direct editing

https://gerrit.wikimedia.org/r/925767

Change 925767 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] BlockedExternalDomains: Make this a special right, prohibit direct editing

https://gerrit.wikimedia.org/r/925767

Change 929167 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/AbuseFilter@master] Fix broken error reporting in BlockedExternalDomains

https://gerrit.wikimedia.org/r/929167

Change 929167 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Fix broken error reporting in BlockedExternalDomains

https://gerrit.wikimedia.org/r/929167

Change 929359 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/AbuseFilter@master] Fix error reporting in BlockedDomainStorage for real

https://gerrit.wikimedia.org/r/929359

Change 929359 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Fix error reporting in BlockedDomainStorage for real

https://gerrit.wikimedia.org/r/929359

Change 692740 abandoned by Jforrester:

[mediawiki/extensions/AbuseFilter@master] [WIP] Import the SpamBlacklist and TitleBlacklist extensions as "SimpleList"

Reason:

Done by Amir instead.

https://gerrit.wikimedia.org/r/692740