Detect robots.txt exclusion on archive.org
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Smile4ever
	Aug 2 2017, 6:34 PM

Description

Is it feasible to detect robots.txt exclusion on archive.org to detect additional dead links?
https://nl.wikipedia.org/w/index.php?title=CouchSurfing&diff=prev&oldid=49481734
https://web.archive.org/web/20120206182825/http://wiki.couchsurfing.org/en/Main_Page

Event Timeline

Smile4ever created this task.Aug 2 2017, 6:34 PM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptAug 2 2017, 6:34 PM

IABot uses filters when requesting archive copies. Any instances of defective archives must have happened retroactively.

GreenC bot detects robots.txt and will try to find a different archive if available. If none available it will keep the robots.txt snapshot, because Wayback management said they plan to remove that policy block sometime in the near future, hopefully.

Detect robots.txt exclusion on archive.orgClosed, DeclinedPublicActions

Description

Event Timeline

Detect robots.txt exclusion on archive.org
Closed, DeclinedPublic
Actions