Page MenuHomePhabricator

Disallow robots from scraping ebooks
Closed, ResolvedPublic


To reduce traffic a little bit (e.g. in the last 12 hours or so there's been 191 hits for book titles from googlebot or bingbot), we could ask robots to not index any of wsexport.

User-agent: *
Disallow: /

Would there be problems with this?

If we did this, we should also add rel="nofollow" to the gadget portal links on Wikisources (e.g. change this).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

FYI, doesn't seem to be respected by many crawlers, including bingbot which is hitting wsexport frequently. I wonder if the same is true for Here's a list of UAs that you can probably safely block via .lighttpd.conf:

Thanks @MusikAnimal that's great!

I've added the following to .lighttpd.conf on staging and production sites:

$HTTP["useragent"] =~ ".*(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp).*" {
        url.access-deny = ("")

I'll keep an eye on the logs and see what's getting blocked.

Of the last 5000 requests (from 08/May/2019:06:25 to now) we've blocked 499, i.e. ~10%.

The Cloud team is assigned in the repo and will merge

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T17:34:32Z] <bd808> Update to 07a15d8 "Block crawlers from wsexport and its staging" (T222684)

bd808 added a subscriber: bd808.
$ curl
User-agent: *
Crawl-delay: 3
Disallow: /betacommand-dev/
Disallow: /fiwiki-tools/

# T133697
Disallow: /persondata/

# T122327
Disallow: /xtools-articleinfo/

# T222684
Disallow: /wsexport/
Disallow: /wsexport-test/

# Disallow XoviBot.
User-agent: XoviBot
Disallow: /