Page MenuHomePhabricator

Shield wmcloud.org and toolforge.org​ against crawler traffic
Closed, DuplicatePublic

Description

Both wmcloud.org and toolforge.org​ are suffering from traffic overload.

  • Obviously this is caused by crawlers and bots.
  • Rather than a wikipedia there is nothing to explore for search engines nor archives.

It should be ensured that no User Agent containing one of the following strings shall receive content:

archive.org_bot
AwarioBot
Amazonbot
bingbot
Brightbot
CCBot
ClaudeBot
DataForSeoBot
DotBot
DuckDuckBot
Googlebot
GPTBot
IABot
libwww-perl
MojeekBot
OAI-SearchBot
PerplexityBot
PetalBot
PriEcoBot
SemanticScholarBot
SemrushBot
SeznamBot
Thinkbot
TelegramBot
Twitterbot
YandexBot

A German technical village pump issue tells more.

  • The list has been collected from a current toolforge log file within 24 h one week ago.
  • The tool could not answer any query any more.
  • Especially Petal=Huawei caused the overload.
  • After filtering as described the tool answered quick and faster than ever.

Rather than implementing individual defensive action into every single tool, wmcloud.org and toolforge.org should maintain a common solution applied to both domains.

  • xtools@wmcloud are also suffering from overload.
  • In T393487#11024836 it is claimed that “these wikis all have robots.txt files that tell all crawlers to ignore the sites”.
    • Well, obviously not. Otherwise those queries would not have been found in recent log file.

On the other hand, the IP blocking at BETA should be terminated as soon as possible. IP ranges are not a good idea to distinguish bots from human beings over months.

Event Timeline

Restricted Application added subscribers: Cyberpower678, Aklapper. · View Herald Transcript
  • In T393487#11024836 it is claimed that “these wikis all have robots.txt files that tell all crawlers to ignore the sites”.
    • Well, obviously not. Otherwise those queries would not have been found in recent log file.

A robots.txt file is only advisory guidance for well-behaved web bots:

The Toolforge front proxy adds a robots.txt file for any tool that does not serve its own:

On the other hand, the IP blocking at BETA should be terminated as soon as possible. IP ranges are not a good idea to distinguish bots from human beings over months.

Beta is actually a third system which is implemented and managed separate from the Toolforge front proxy and the Cloud VPS web proxy.

libwww-perl/6.68 is the record holder (3 times more visits that PetalBot). But PetalBot was the one, who found really time-consuming urls.

Nope. I am requesting harsh action for both wmcloud.org and toolforge.org​ as entire domains.

  • T226688 is dealing with beta.wmcloud.org only.

A robots.txt file is only advisory guidance for well-behaved web bots:

The request says “no User Agent containing one of the following strings shall receive content”.

  • That does not mean: Oh, please, dear crawler, obey our robots.txt declaration!
  • It demands: No content (which is 403).
  • I do not care about good manners.
  • If disguised by using a regular browser identification that will need special action.

Please note the code of Wurgl; it will exit as soon as one of those strings is found.

The defense strategy does need a restart, since our tools, services and usage of BETA suffer from significant limitations in productive work.

  • robots.txt is pointless. Rather than for a wikipedia no search engine crawler is required here. Do not bother with good behaviour, just bounce back.
  • The substring method of Wurgl is sufficient, since it solved the problem.
  • If any bot is using a fake agent identification, this particular ID combined with IP range might be used at second stage to avoid collateral damage for humans equipped with that current browser version.

T226688 is dealing with beta.wmcloud.org only.

I do not see a single mention of beta.wmcloud.org in T226688: Block web crawlers from accessing Cloud Services.