Page MenuHomePhabricator

Bad regexes make the at least some of the blacklist get ignored
Closed, ResolvedPublic

Description

Author: mike.lifeguard+bugs

Description:
If a bad regex exists in the blacklist (you cannot enter new bad regexes, as the page will not be saved, but old ones may remain), at least part of the blacklist don't get applied properly. I had thought that the regex of each line was joined together to make a big one, and bad regexes were split out & ignored. However, that would mean all /good/ regexes should be applied - that did not occur in this case.

You can see a supposedly-blacklisted domain added here: http://en.wikipedia.org/w/index.php?title=MediaWiki_talk:Spam-blacklist&diff=230441961&oldid=230437845


Version: unspecified
Severity: major
URL: http://en.wikipedia.org/w/index.php?title=MediaWiki_talk:Spam-blacklist&oldid=230596968#bugs_in_list

Details

Reference
bz15099

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:23 PM
bzimport added a project: SpamBlacklist.
bzimport set Reference to bz15099.

As I recall, the way the spam blacklist regex constructions works is:

  1. Build all the individual regexes
  1. Compile them into a list of a few reasonably-sized large chunks
  1. Test each of the chunks
  1. For any chunk that fails to parse, split it up into its individual component regexes, so that all the non-failing ones will still be applied.

Note that a particular instance of a should-be-blacklisted URL sneaking through in an edit doesn't necessarily mean that regexes failed to parse, it could simply mean that the fetching of all the regexes failed on that host at that particular time.

Ok, the problem here is that the result is a valid regex, just not the desired regex. ;)

http://en.wikipedia.org/w/index.php?title=MediaWiki:Spam-blacklist&diff=230442308&oldid=230364450

Previous line ended with a stray \ which turned the group separator | into \|, thus matching just the text "|", so breaking those two adjacent bits.

An explicit check for final "\" should probably be enough to work around this, since other metachars should lead to explicit breakage... hopefully... :)

Perhaps a hack in SpamRegexBatch::buildRegexes to ensure it doesn't break even on old errors, though a hack in SpamRegexBatch::getBadLines() will do to warn on editing the blacklist.

Seems to work in local testing. :D Committed in r39114

mike.lifeguard+bugs wrote:

(In reply to comment #2)

Note that a particular instance of a should-be-blacklisted URL sneaking through
in an edit doesn't necessarily mean that regexes failed to parse, it could
simply mean that the fetching of all the regexes failed on that host at that
particular time.

Could we make sure that if that happens the last fetched version of the blacklist is used? A network failure or something should open the wikis to spamming simply because we couldn't get the blacklist from Meta the last time we tried. If we knew that that wasn't the case, then it'd warrant it's own bug.