Page MenuHomePhabricator

Disallow robots from scraping ebooks
Closed, ResolvedPublic

Description

To reduce traffic a little bit (e.g. in the last 12 hours or so there's been 191 hits for book titles from googlebot or bingbot), we could ask robots to not index any of wsexport.

User-agent: *
Disallow: /

Would there be problems with this?

If we did this, we should also add rel="nofollow" to the gadget portal links on Wikisources (e.g. change this).

Event Timeline

Samwilson created this task.May 7 2019, 7:16 AM
Restricted Application added a project: Community-Tech. · View Herald TranscriptMay 7 2019, 7:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MaxSem claimed this task.May 8 2019, 1:47 AM
MaxSem added a project: Community-Tech-Sprint.
MaxSem moved this task from Ready to In Development on the Community-Tech-Sprint board.
MaxSem moved this task from Backlog to In sprint on the Wikisource Export board.May 8 2019, 1:51 AM
MusikAnimal added a subscriber: MusikAnimal.EditedMay 8 2019, 3:58 AM

FYI, https://xtools.wmflabs.org/robots.txt doesn't seem to be respected by many crawlers, including bingbot which is hitting wsexport frequently. I wonder if the same is true for https://tools.wmflabs.org/robots.txt. Here's a list of UAs that you can probably safely block via .lighttpd.conf: https://wikitech.wikimedia.org/wiki/Tool:XTools#xtools.conf

Thanks @MusikAnimal that's great!

I've added the following to .lighttpd.conf on staging and production sites:

$HTTP["useragent"] =~ ".*(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp).*" {
        url.access-deny = ("")
}

I'll keep an eye on the logs and see what's getting blocked.

Of the last 5000 requests (from 08/May/2019:06:25 to now) we've blocked 499, i.e. ~10%.

The Cloud team is assigned in the repo and will merge

Mentioned in SAL (#wikimedia-cloud) [2019-05-14T17:34:32Z] <bd808> Update to 07a15d8 "Block crawlers from wsexport and its staging" (T222684)

bd808 closed this task as Resolved.May 14 2019, 5:35 PM
bd808 added a subscriber: bd808.
$ curl https://tools.wmflabs.org/robots.txt
User-agent: *
Crawl-delay: 3
Disallow: /betacommand-dev/
Disallow: /fiwiki-tools/

# T133697
Disallow: /persondata/

# T122327
Disallow: /xtools-articleinfo/

# T222684
Disallow: /wsexport/
Disallow: /wsexport-test/

# Disallow XoviBot.
User-agent: XoviBot
Disallow: /