Whitelist for spam blacklist
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	• bzimport
	Mar 22 2005, 11:00 PM

Description

Author: silsor

Description:
This has been planned for a while but there was no bug for it.

See whitelist-related material on http://meta.wikimedia.org/wiki/Spam_blacklist

Version: unspecified
Severity: normal
URL: http://meta.wikimedia.org/wiki/Spam_blacklist

Details

Reference: bz1733

Revisions and Commits

Unknown Object (Diffusion Commit)

Related Objects

Mentioned In: T6459: Create a special page to handle additions, removals, changes and logging of spam blacklist entries
T39477: Feedback details text on "View feedback page" cut off for very long string without whitespace
T39476: Form submission error occurs that is not repeatable < OH-AFT>
T39475: Colour Contrast of text ‘See All Comments’ within the button is completely invisible <OH-AFT>
T39474: Colour Contrast of text within the button ‘Post Your Feedback’ after input is hard to read < OH-AFT>

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:16 PM

• bzimport added a project: MediaWiki-extensions-General.

• bzimport set Reference to bz1733.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Mar 22 2005, 11:00 PM

silsor wrote:

Make that http://meta.wikimedia.org/wiki/Talk:Spam_blacklist, duh

Can't this already be done with the regular expressions?

silsor wrote:

Not as far as I know.

Couldn't the huge spam list be broken per domain, for much faster finding?

As far as I know, the valid TLDs are strictly limited and wellknown (their list is publised by ICANN). So
invalid TLDs (including commercial pseudo-TLDs that have not been approved by ICANN and use specific DNS
systems or that require a client-side DNS client patch like NewNet which is most often stealing privacy,
i.e. spyware) can be eliminated immediately. Keep just the ICANN list.

Then break the spam list per valid TLD, as it will also ease its management, as the list becomes huge...
Each TLD list should also come into two parts: one using simple string equality (scanned first, it is
sorted alphabetically for fast finding), and a final section using regexps (regexps require too much memory
resource on the server).

For efficient finding, it should be useful to reverse the order of domain name parts in the domain name:
www.xyz.com becomes com.xyz.www, which is then splitted into physical file folders (or virtual ones on
memory using arrays) if there are multiple exclusions:

com/
  xyz/
    www

For example:

blacklist = array(0, //block all other non-ICANN TLDs

com=>array(1, //pass all .com by default
  xyz=>array(1, //pass "xyz.com" except the following subdomains:
    www=>0, //block this host and subdomains
    //the other hosts in ".xyz.com" pass as set in the parent rule
    ),
  spamsite=>0,//block this domain and all subdomains
  // other simple xxx.com block rules come here...
  "*" => array(1, //using regexps, pass by default
    "[a-z][0-9]{5,}"=>0 // block <numeric>.com with 5 digits or more
    ),
  ),
net=>array(1, //pass all .net by default
  //block rules for .net come here
  ),
org=>array(1, //pass all .org by default
  //block rules for .org come here
  ),
de=>array(1, //pass all .de by default
  //block rules for .de come here
  ),
fr=>array(1, //pass all .fr by default
  //block rules for .de come here
  ),
//other accepted TLDs come here...

);
Then domain name can be performed by simple table lookup, using one domain name part at a time:

if the value is an integer, then it gives the blocking rule for the current domain and all its subdomains
if the value is an array, then the first entry at index 0 gives the blocking rule (0=pass or 1=block),

and the other entries contain other domain name parts to scan for exceptions.

if there's no entry for the scanned domain namepart in the array, then look if there's a "*" entry. If

so, uses regexps matching for scanning its list from first to last and get their blocking rule.

This will reduce a lot the use of regexps. The array above can be easily built by reading and parsing once
a text file where these rules are summarized and maintained.

I've implemented a whitelist in r14912. It's editable by
local admins at MediaWiki:Spam-whitelist, and is in the same
format as the blacklist page.

• bzimport mentioned this in T39474: Colour Contrast of text within the button ‘Post Your Feedback’ after input is hard to read < OH-AFT>.Nov 22 2014, 12:26 AM

• bzimport mentioned this in T39475: Colour Contrast of text ‘See All Comments’ within the button is completely invisible <OH-AFT>.

• bzimport mentioned this in T39476: Form submission error occurs that is not repeatable < OH-AFT>.

• bzimport mentioned this in T39477: Feedback details text on "View feedback page" cut off for very long string without whitespace.

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:14 AM

Danny_B edited projects, added SpamBlacklist; removed MediaWiki-extensions-General.Jul 11 2016, 12:25 AM

Danny_B removed a subscriber: • wikibugs-l-list.

Danny_B removed a parent task: T6462: [DO NOT USE] Spam blacklist (tracking) [superseded by #SpamBlacklist].Jul 11 2016, 12:27 AM

MarcoAurelio mentioned this in T6459: Create a special page to handle additions, removals, changes and logging of spam blacklist entries.Sep 20 2016, 6:36 AM