Page MenuHomePhabricator

Allow $wgSFSIPListLocation to be a url and have proxy support
Open, NormalPublic

Description

You can download IP blacklists, and import them using the maintenance/updateBlacklist.php script. StopForumSpam has several lists; we recommend using the "listed_ip_30_all" list. Once you choose the list you want, download and extract it to somewhere on your server, then point $wgSFSIPListLocation in the LocalSettings.php file at it. We recommend setting up a nightly cron job to download and extract new versions of the list and subsequently running the updateBlacklist maintenance script.

This really doesn't fit into the WMF way of doing things for production (or beta)...

$wgSFSIPListLocation being a url, allowing fetching from there (with proxy support!) is necessary

Event Timeline

Reedy created this task.Jul 8 2019, 9:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 8 2019, 9:44 AM
sbassett added a subscriber: sbassett.EditedJul 12 2019, 10:24 PM

Looks like it's just an fopen() call. Not sure if it'd be easier to leave $wgSFSIPListLocation as a local file and set up the cron to pick up whatever SFS files we'd like with something like:

export https_proxy=http://webproxy.eqiad.wmnet:8080
0 0 0 0 0 curl https://www.stopforumspam.com/downloads/listed_ip_365_ipv6.zip -o /path/to/local/file

Or if we need to pull down the daily SFS updates and merge those. These files get fairly large btw.

Proxying to their API is probably a really bad idea.

sbassett triaged this task as Normal priority.Jul 12 2019, 10:24 PM
Reedy added a comment.Jul 13 2019, 7:10 AM

Proxying to their API is probably a really bad idea.

I don’t mean in real time, just be able to do outbound requests to get the list via our proxies needed for outgoing requests (see what we do in extensions like TorBlock which does the request via cron and shoves it in object cache)

Reedy added a comment.Jul 13 2019, 7:15 AM

Though, if the file is sufficiently large... we probably don’t want to be putting them in memcached...

Might be worth doing some testing with the size of the resultant object

I assume we'd be interested in the All Site Data files here, namely the IPv4 and IPv6 Combined files. These have the following sizes:

FileSize Compressed (gz or zip)Size Uncompressed (text)
listed_ip_1_ipv4629 Kb95 Kb
listed_ip_7_ipv4689 Kb318 Kb
listed_ip_30_ipv46231 Kb890 Kb
listed_ip_90_ipv46538 Kb2.2 Mb
listed_ip_180_ipv46878 Kb3.7 Mb
listed_ip_365_ipv461.2 Mb5.2 Mb

The daily indicator (1, 7, 30, 90, etc.) is apparently a "last seen active (causing trouble) within X days" reference for the given list of IPs. I'm not sure what the Download Limit column is. I assumed it was some sort of IP-based throttle per file download, but I've been able to download the files multiple times to my local laptop which has a static IP. Anyhow, none of these files are particularly monstrous to download, though there would indeed be concerns about tossing them into a config file or cache. They do seem to be fairly accurate as I found several spammy IPs from the recent attack (T227416) within these lists (I believe @MarcoAurelio did as well.) I've no idea what the false positive rate might be, which is probably something we'd have to test on beta and then maybe a handful of smaller project wikis. @Reedy - when you return, can we get this deployed to beta? (I've never done that before.)

The SFS extension doesn't support IPv6 yet (T173399), but there were very few IPv6 addresses in the blocklist anyways.

I think downloading on-demand is a bit sketchy, and requires the constant uptime of the SFS website. I'd rather have a cronjob that regularly wget's the latest file and unzips it into place. Even if that fails, we'll still have an old blacklist on disk to fallback too.

Reedy added a comment.Jul 16 2019, 6:11 PM

I think downloading on-demand is a bit sketchy, and requires the constant uptime of the SFS website. I'd rather have a cronjob that regularly wget's the latest file and unzips it into place. Even if that fails, we'll still have an old blacklist on disk to fallback too.

Problem is if/when it falls out of cache, the file is only going to exist on one host and that host can only be the one to repopulate the cache

But as above, I wasn’t suggesting doing it on demand for every request, just like we do for torblock