Page MenuHomePhabricator

Block web crawlers from accessing Cloud Services
Open, MediumPublic

Description

It is my theory that web crawlers may indirectly impair the performance and stability of the replica databases, among other services.

What happens I think is that the community will link to tools on-wiki, and then they are followed by web crawlers. The problem is that these tools fire off long-running queries.

XTools is a prime example. We have a long list of UAs for legit web crawlers that we block on the Apache level. See step #12 at https://wikitech.wikimedia.org/w/index.php?title=Tool:XTools#Building_a_new_instance (somewhat out of date). If we didn't do this, XTools would go down due to hitting the max connection limit. I also found these crawlers did not respect https://xtools.wmflabs.org/robots.txt.

We saw a similar issue with WS Export. Significant traffic from crawlers were hogging up resources, causing the tool to go down.

For similar reasons I put Tool-global-search behind a login wall. The Cloud Elastic service was still experimental and I didn't want crawlers and bots to impact stability.

I can only assume other tools suffer from crawler traffic, which in turn puts unnecessary load on our infrastructure. In particular, I wonder what the health of the replicas would be after blocking web crawlers.

Event Timeline

Here's the User-Agent block list that @MusikAnimal has seeded the discussion with:

SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp)" bad_bot=yes

I took this and used it as a grep filter to get some idea of how much traffic it would block:

$ grep -v -E "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp)" access.log.1 | wc -l
4014899
$ wc -l access.log.1
4347045 access.log.1

For this particular slice of raw logs for tools.wmflabs.org, this UA restriction would have blocked about 8% of the request traffic (332,146 requests). I did a bit more log slicing and dicing to try and see what tools would be most affected:

$ grep -E "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp)" access.log.1 | awk '{print $8}' | awk 'BEGIN {FS="[/?%]+"} {print $2}' | sort | uniq -c | sort -hr

   7484 checkwiki
   4834 paws-public
   4291 ru_monuments
   3583 scholia
   3539 openrefine-wikidata
   3522 reasonator
   3505 osm4wiki
   3288 para
   3203 kmlexport
   3186 dispenser
   3185 heritage
   2972 wikidata-externalid-url
   2925 pinyin-wiki
   2469 bing-maps
   2422 wp-world
   2290 freebase
   1788 os
   1770 copyvios
   1734 whois
   1683 denkmalliste
   1581 templatetiger
   1581 citations
   1541 list
   1471 jackbot
   1381 robots.txt
   1290 catnap
   1286 persondata
   1284 guc
   1234 refill
   1230 cluebotng
   1168 sigma
   1165 tools.wmflabs.org
   1092 fist
   1047 vcat
   1004 giftbot
    964 weeklypedia
    862 meta
    834 lexeme-forms
    781 xtools
    708 zoomviewer
    692 magnustools
    691 dewikinews-rss
    690 wikivoyage
    666 wikiloves
    652 ifttt-testing
    637 catscan2
    626 wiwosm
    597 typoscan
    592 commons-
    529 geocommons
    507 dplbot
    474 glamtools
    461 citation-template-filling
    445 autodesc
    417 potd-feed
    396 wikitrends
    389 eranbot
    388 wikisense
    360 templatecount
    330 iabot
    311 afdstats
    310 bibleversefinder
    309 ipcheck
    308 flickr2commons
    294 supercount
    294 manypedia
    293 wsexport
    273 quentinv57-tools
    261 isbn
    260 xtools-articleinfo
    216 mp
    190 flickr
    187 imagemapedit
    179 wikidata-todo
    179 isin
    174 locator
    172 bibleversefinder2
    168 templatetransclusioncheck
    161 redirectviews
    157 commonshelper
    151 awb
    143 blahma
    140 wpcleaner
    140 quick-intersection
    136 dump-torrents
    136
    131 paste
    131 citationhunt
    127 xtools-ec
    126 fountain
    118 mzmcbride
    117 copypatrol
    115 grep
    113 pbbot
    106 sighting
    106 periodibot
    102 wikishootme
     99 lrtool
     94 topviews
     94 stimmberechtigung
     93 magog
     93 hay
     92 video2commons
     91 erwin85
     90 intersect-contribs
     90 fiwiki-tools
     85 osm
     85 betacommand-dev
     85 admin
     81 ldap
     78 croptool
     76 jembot
     75 apple-touch-icon.png
     74 timescale
     74 favicon.ico
     72 massviews
     72 checkpersondata
     71 detox
     71 apersonbot
     66 interaction-timeline
     63 wembedder
     62 mix-n-match
     62 hatjitsu
     61 mathbot
     60 catfood
     60 apple-touch-icon-precomposed.png
     58 dupdet
     55 joanjoc
     51 sge-status
     51 intuition
     50 globalusagecount
     49 mapycz
     48 tfaprotbot
     48 langviews
     44 videoconvert
     43 wikiwatchdog
     43 hub
     41 jawi
     39 parliamentdiagram
     39 lists
     39 img
     37 sal
     37 readmore
     35 siteviews
     35 robin
     35 nppbrowser
     3
     29 dewkin
     28 whodunnit
     28 query2map
     28 pywikibot
     27 ptwikis
     26 templator
     24 userviews
     24 quickstatements
     24 mediawiki-feeds
     23 multichill
     22 render
     22 comprende
     21 pltools
     21 editgroups
     20 random-featured
     20 makeref
     20 itwikinews-rss
     20 ipinfo
     20 excel2wiki
     19 jarry-common
     19 authors
     18 missingtopics
     17 wiki
     16 wikidata-game
     16 projektneuheiten-feed
     16 panoviewer
     16 mediaviews
     15 xtools-pages
     15 wle
     15 images
     15 hashtags
     15 csp-report
     14 sqid
     14 spellcheck
     14 pb
     14 owintes
     13 wlm-stats
     13 krdbot
     13 citer
     13 book2scroll
     12 w-slackbot
     12 wdvd
     12 rightstool
     12 pagecount
     12 newbie-uploads
     12 kasparbot
     12 dawikitool
     12 connectivity
     12 commons-
     10 widar
     10 sowhy
     10 rxy
     10 request
     10 metaviews
     10 etytree
     10 dykautobot
      9 wlm-maps
      9 wiki-todo
      9 tooltranslate
      9 sourcemd
      9 sge-jobs
      9 mormegil
      9 missingpages
      9 liangent-toolserver
      9 ia-upload
      9 geograph2commons
      9 fengtools
      9 dykstats
      9 dexbot
      9 cite-o-meter
      9 catscan3
      9 anagrimes
      8 wmcharts
      8 userview
      8 svgcheck
      8 prop-explorer
      8 oabot
      8 monumental
      8 lp-tools
      8 ipp
      8 freddy2001
      8 editathonstat
      8 convert
      8 citing-bot
      7 wptestblog2
      7 wptestblog
      7 wikidata-exports
      7 trusty-tools
      7 tb-dev
      7 superyetkin
      7 replag
      7 popularpages
      7 h2bot
      7 edgars
      7 bene
      6 zoomable-images
      6 wmcounter
      6 tools.wmfla
      6 thibtools
      6 sql-optimizer
      6 relgen
      6 plus
      6 mediawiki-mirror
      6 map-of-monuments
      6 isbn2wiki
      6 hoo
      6 hauki
      6 derivative
      5 yadkard
      5 wikiradio
      5 wikilint
      5 wikihistory
      5 wikidata-terminator
      5 videotutorials
      5 stewardbots
      5 slumpartikel
      5 replacer
      5 orphantalk
      5 openstack-browser
      5 mrmetadata
      5 languagetool
      5 ip-range-calc
      5 grid-jobs
      5 commons-video-clicks
      5 checker
      4 yadfa
      4 wsm
      4 wlm-us
      4 wikidata-timeline
      4 wikidata-analysis
      4 url2commons
      4 tusc
      4 traffic-grapher
      4 tmp
      4 template-choice.php
      4 tabernacle
      4 static
      4 splinetools
      4 searchsbl
      4 patrolstats
      4 pageviews-test
      4 pagepile
      4 multidesc
      4 locator-tool
      4 limesmap
      4 lexeme-senses

      4 citeplato
      4 bub
      4 bawolff
      4 aivanalysis
      3 wwwroot.rar
      3 wscontest
      3 wikitext-deprecation
      3 wikiinfo
      3 wd-depicts
      3 vendor
      3 url-converter
      3 ukbot
      3 tools.zip
      3 tools.rar
      3 text2hash
      3 swviewer
      3 styleguide
      3 sbot
      3 russbot
      3 ruarbcom
      3 res
      3 pathway-viewer
      3 osm4cgi-bin
      3 oauth-hello-world
      3 not-in-the-other-language
      3 mediaviews-api
      3 integraality
      3 index.php
      3 iluvatarbot
      3 idsgen
      3 E5
      3 dibot
      3 delinker
      3 coord
      3 contact
      3 cobot
      3 cobain
      3 botwatch
      3 blockyquery
      3 bf.rar
      3 backup.rar
      3 author-disambiguator
      3 anomiebot
      2 zimmerbot
      2 yifeibot
      2 yichengtry
      2 ws-google-ocr
      2 ws-cat-browser
      2 wmukevents
      2 wikidata-redirects-conflicts-reports
      2 whichsub
      2 webarchivebot
      2 w
      2 usualsuspects
      2 urbanecmbot
      2 tsreports
      2 translate
      2 tool-db-usage
      2 tedbot
      2 tb
      2 svgworkaroundbot
      2 svgtranslate-test
      2 svg-map-maker
      2 soxred93
      2 snapshots
      2 slow-parse
      2 sign-language-browser
      2 sdm
      2 ruwikisource
      2 revertstat
      2 render-tests
      2 quickcategories
      2 ppp-sparql
      2 or.wikipedia.org
      2 ordia
      2 oojs-ui
      2 nppdash
      2 npp
      2 nikola
      2 montage
      2 meetbot
      2 mc8
      2 mbh
      2 listpages
      2 ktc
      2 jorobot
      2 jogotools
      2 isprangefinder
      2 iplookup
      2 interactoa
      2 inkowik
      2 huggle
      2 historicmaps
      2 hewiki-tools
      2 hashtags-test
      2 grantmetrics-test
      2 ~geohack
      2 genealog
      2 freefiles
      2 filedupes
      2 extreg-wos
      2 dschwen
      2 drtrigonbot
      2 doc
      2 dimastbkbot
      2 deadlinks
      2 dapete
      2 crosswatch
      2 commons-app-stats
      2 comidentgen
      2 cluestuff
      2 blankpages
      2 bibleversef
      2 awmd-stats
      2 artuploader
      2 ads.txt
      1 yunomi
      1 wp-login.php
      1 wikiviewstats2
      1 wikidata-compare
      1 wdpv
      1 wdprop
      1 wd-image-positions
      1 wam
      1 urdusign
      1 tptools
      1 topviews-test
      1 topicmatcher
      1 tool=youtube-channel
      1 tool=wiwosm
      1 tool=pmidtool
      1 tool=pltools
      1 tool=mdann52bot
      1 Tool_Labs_logo_thumb.png
      1 tool=coursestats
      1 tool=bambots
      1 tmg
      1 textcatdemo
      1 svgedit
      1 stereoskopie
      1 status
      1 statistics
      1 speedpatrolling
      1 smv-description-translations
      1 signpostlab
      1 samoabot

      1 pub
      1 proneval-gsoc17
      1 primerpedia
      1 phabricator-bug-status
      1 personabot
      1 pas
      1 pagevie
      1 page
      1 ores-support-checklist
      1 myrcx
      1 mwstew
      1 most-wanted
      1 mohib
      1 meta.wikimedia.org
      1 menubar.js
      1 map-search
      1 maintgraph
      1 main.css
      1 magnus
      1 list&_escaped_fragment_=
      1 lestaty
      1 lcm-dashboard
      1 krinkle-redirect
      1 it-wiki-users-leaflet
      1 itwiki
      1 itsource
      1 isbn-tmptest
      1 isa
      1 inactiveadmins
      1 import-500px
      1 hrwiki
      1 hroest
      1 hgztools
      1 hatjits
      1 hatjit
      1 hatji
      1 hatj
      1 hat
      1 hall-of-fame
      1 ha
      1 gutrs
      1 gridengine-status
      1 gpy
      1 globalsearch
      1 gerrit-patch-uploader
      1 gerakitools
      1 geohack.php
      1 geoha
      1 g
      1 fn
      1 five-million
      1 fatameh
      1 dschwenbot
      1 dockerregistry
      1 disclaim
      1 dataviz
      1 css
      1 contributionsurveyor
      1 churches
      1 cdnjs-beta
      1 cdnjs
      1 bub2
      1 bsaut
      1 bldrwnsch
      1 author=9
      1 author=8
      1 author=7
      1 author=6
      1 author=5
      1 author=4
      1 author=3
      1 author=2
      1 author=15
      1 author=14
      1 author=13
      1 author=12
      1 author=11
      1 author=10
      1 author=1
      1 assets
      1 ash-dev
      1 ascal
      1 api
      1 antigng-bot
      1 admin-beta
      1 adas
      1 actrial
      1 43.461111111111_N_-3.9252777777778_E_type:city

A lot of the crawled urls are malformed garbage, but many of them are things that could cause db lookups as @MusikAnimal hypothesized.

I think there are several questions to consider related to this:

  • Is blanket blocking at the reverse proxy a good idea in general?
  • Do we want to block all crawlers, or just some set of "bad" crawlers?
  • Would some tools need/want an opt-out switch to allow some/all crawlers?
  • How would we decide if the crawler list needed to be updated?
  • What kind of response would we give to the blocked user-agents?

I suggest that rather than a global block we think about this as a tooling issue -- maybe provide a default 'block everything' robots.txt (or even an actual service block) and a well-documented way for users to manage this.

Bstorm lowered the priority of this task from High to Medium.Feb 25 2020, 5:21 PM

I suggest that rather than a global block we think about this as a tooling issue -- maybe provide a default 'block everything' robots.txt (or even an actual service block) and a well-documented way for users to manage this.

This is now the case for all tools behind the Toolforge shared proxy following the resolution of T251628: Serve some default well known files for Toolforge webservices. The solution that was used there could also be done for the Cloud VPS proxy (catch 404 for /robots.txt and serve a default version).