Block spider / web crawler on tool labs
Closed, InvalidPublic
Actions

Assigned To

Authored By

	• bzimport
	Jul 20 2014, 7:02 PM

Description

Author: metatron

Description:
Tracking ticket to block aggressive spiders and web crawlers on tool labs (tools.wmflabs.org/*)

These spiders should be blocked at network or proxy level rather than in individual lighttpd-configs or even applications to avoid a waste of resources.

Version: unspecified
Severity: normal

Details

Reference: bz68300

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Invalid		coren	T70300 Block spider / web crawler on tool labs
		Resolved		yuvipanda	T73120 Block TweetmemeBot UA

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:33 AM

• bzimport added projects: Toolforge, Tracking-Neverending.

• bzimport set Reference to bz68300.

• bzimport created this task.Jul 20 2014, 7:02 PM

metatron wrote:

Aggressive species:

-SeznamBot
-SputnikBot
-Sogou web spider
-TweetmemeBot
-kinshoobot
-CCBot
-Scrapy
-Baiduspider
-Yahoo! Slurp

User agents:
"Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)"
"Mozilla/5.0 (compatible; SputnikBot/2.3; +http://corp.sputnik.ru/webmaster)
"Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
"Mozilla/5.0 (compatible; TweetmemeBot/3.0; +http://tweetmeme.com/)"
"kinshoobot (/global; amd64 Linux 3.10.23-xxxx-std-ipv6-64; java 1.8.0_05; Europe/fr) http://kinshoo.net/bot.html"
"CCBot/2.0 (http://commoncrawl.org/faq/)"
"Scrapy/0.22.0 (+http://scrapy.org)"
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
"Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

metatron wrote:

360Spider

User Agent:
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1; 360Spider"

coren moved this task from Backlog to Ready to be worked on on the Toolforge board.Nov 25 2014, 4:09 PM

Are these still a problem? The proxy can now reject based on User Agents, so these should be trivial to implement. TweetmemeBot is already blocked.

A tracking task with only one (resolved) blocking task doesn't make much sense to me :-).

There are certainly some things to improve about our spider treatment, and one of them is probably that we shouldn't deal with this case by case. However, due to the syntax of robots.txt, IIRC allowing bots to access / and the status pages below requires explicitly disallowing all other paths. So we would need to generate robots.txt dynamically by nginx including all tools except for some whitelist where the tool maintainer said: "I guarantee that bots can spider this tool without performance impact."

This sounds like work to be coordinated with a lot of people, and I would procrastinate that until the pain level rises.

• Phabricator_maintenance removed a parent task: T4007: [DO NOT USE] Tracking bug [superseded by #Tracking].Jul 28 2016, 2:33 AM

Danny_B removed a project: Tracking-Neverending.Jul 29 2016, 4:25 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 29 2016, 4:25 PM

Danny_B renamed this task from [tracking] Block spider / web crawler on tool labs to Block spider / web crawler on tool labs.Jul 29 2016, 4:25 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:58 PM