Page MenuHomePhabricator

Block spider / web crawler on tool labs
Closed, InvalidPublic

Description

Author: metatron

Description:
Tracking ticket to block aggressive spiders and web crawlers on tool labs (tools.wmflabs.org/*)

These spiders should be blocked at network or proxy level rather than in individual lighttpd-configs or even applications to avoid a waste of resources.


Version: unspecified
Severity: normal

Details

Reference
bz68300

Related Objects

StatusSubtypeAssignedTask
Invalidcoren
Resolvedyuvipanda

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:33 AM
bzimport set Reference to bz68300.

metatron wrote:

Aggressive species:

-SeznamBot
-SputnikBot
-Sogou web spider
-TweetmemeBot
-kinshoobot
-CCBot
-Scrapy
-Baiduspider
-Yahoo! Slurp

User agents:
"Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)"
"Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster)
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
"Mozilla/5.0 (compatible; TweetmemeBot/3.0; +http://tweetmeme.com/)"
"kinshoobot (/global; amd64 Linux 3.10.23-xxxx-std-ipv6-64; java 1.8.0_05; Europe/fr) http://kinshoo.net/bot.html"
"CCBot/2.0 (http://commoncrawl.org/faq/)"
"Scrapy/0.22.0 (+http://scrapy.org)"
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

metatron wrote:

  • 360Spider

User Agent:
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1; 360Spider"

Are these still a problem? The proxy can now reject based on User Agents, so these should be trivial to implement. TweetmemeBot is already blocked.

A tracking task with only one (resolved) blocking task doesn't make much sense to me :-).

There are certainly some things to improve about our spider treatment, and one of them is probably that we shouldn't deal with this case by case. However, due to the syntax of robots.txt, IIRC allowing bots to access / and the status pages below requires explicitly disallowing all other paths. So we would need to generate robots.txt dynamically by nginx including all tools except for some whitelist where the tool maintainer said: "I guarantee that bots can spider this tool without performance impact."

This sounds like work to be coordinated with a lot of people, and I would procrastinate that until the pain level rises.

Danny_B renamed this task from [tracking] Block spider / web crawler on tool labs to Block spider / web crawler on tool labs.Jul 29 2016, 4:25 PM