Page MenuHomePhabricator

provide a more strict robots.txt at Tool Labs
Closed, ResolvedPublic

Description

See T127066: Bingbot scraping tools? for reason.
The current robots.txt:

User-agent: *
Disallow: /betacommand-dev/
Disallow: /fiwiki-tools/
# Disallow XoviBot.
User-agent: XoviBot
Disallow: /

Many dynamic pages which take long time to load should be added.

Event Timeline

Bugreporter raised the priority of this task from to Needs Triage.
Bugreporter updated the task description. (Show Details)
Bugreporter added a project: Toolforge.
Bugreporter added a subscriber: Bugreporter.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

We had a more restricted robots.txt in the past (March 2014; cf. T63132) and consensus for it. This changed in the following eleven months so when I committed d76c5f0a398b827f999fd1d8bd123798fdf0f35b, the then current version, it had become "allow by default".

IIRC my thoughts regarding robots.txt were:

  1. No tool maintainer will warn ahead of time that their tool may be a problem. So I personally prefer a whitelist of tools where the maintainers have understood the problem and certified that their tools are benign.
  2. The syntax of robots.txt does not seem to allow indexing the top page (/), but disallow all subpages except /?list, /goodtool1, /goodtool2, etc., so robots.txt would need to be generated dynamically from a list of all tools minus the ones that are benign.

Many dynamic pages which take long time to load should be added.

Do you have a representative sample of such URLs?

See T127066: Bingbot scraping tools? for reason.

That's not called a reason, it's called pre-emptive optimisation. The actual bug needs to be fixed here; pages getting visits is not a bug.

If we found that the problem is "too many concurrent requests", robots.txt should contain Crawl-delay: 1.

My tools were fine until ~Feb 14 2016. Something changed around that time; could be the DB fail, but I can't tell. Since then, some (by far not all) tools started accumulating requests and becoming non-responsive, but without the webservice actually dying.

The crawlers might just be a symptom.

OK, most tools seem to be back to normal, except "catscan2" and "glamtools". Could those be added to robots.txt please?

chasemp triaged this task as Medium priority.Apr 4 2016, 2:31 PM
bd808 claimed this task.
bd808 added a subscriber: bd808.

T251628: Serve some default well known files for Toolforge webservices has made the default /robots.txt for all tools disallow compliant crawlers. Tools that wish to be crawled can serve their own /robots.txt content. This has been announced on the cloud-announce mailing list and added to the Toolfroge webservice help on Wikitech.