Page MenuHomePhabricator

Block web crawlers from accessing Cloud Services
Open, MediumPublic

Description

It is my theory that web crawlers may indirectly impair the performance and stability of the replica databases, among other services.

What happens I think is that the community will link to tools on-wiki, and then they are followed by web crawlers. The problem is that these tools fire off long-running queries.

XTools is a prime example. We have a long list of UAs for legit web crawlers that we block on the Apache level. See step #12 at https://wikitech.wikimedia.org/w/index.php?title=Tool:XTools#Building_a_new_instance (somewhat out of date). If we didn't do this, XTools would go down due to hitting the max connection limit. I also found these crawlers did not respect https://xtools.wmflabs.org/robots.txt.

We saw a similar issue with WS Export. Significant traffic from crawlers were hogging up resources, causing the tool to go down.

For similar reasons I put Tool-global-search behind a login wall. The Cloud Elastic service was still experimental and I didn't want crawlers and bots to impact stability.

I can only assume other tools suffer from crawler traffic, which in turn puts unnecessary load on our infrastructure. In particular, I wonder what the health of the replicas would be after blocking web crawlers.

Event Timeline

Here's the User-Agent block list that @MusikAnimal has seeded the discussion with:

SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp)" bad_bot=yes

I took this and used it as a grep filter to get some idea of how much traffic it would block:

$ grep -v -E "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp)" access.log.1 | wc -l
4014899
$ wc -l access.log.1
4347045 access.log.1

For this particular slice of raw logs for tools.wmflabs.org, this UA restriction would have blocked about 8% of the request traffic (332,146 requests). I did a bit more log slicing and dicing to try and see what tools would be most affected:

$ grep -E "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ12bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|Acoo Browser|AcooBrowser|\.NET CLR 2\.0\.50727|Frontera|tigerbot|Slackbot|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp)" access.log.1 | awk '{print $8}' | awk 'BEGIN {FS="[/?%]+"} {print $2}' | sort | uniq -c | sort -hr

   7484 checkwiki
   4834 paws-public
   4291 ru_monuments
   3583 scholia
   3539 openrefine-wikidata
   3522 reasonator
   3505 osm4wiki
   3288 para
   3203 kmlexport
   3186 dispenser
   3185 heritage
   2972 wikidata-externalid-url
   2925 pinyin-wiki
   2469 bing-maps
   2422 wp-world
   2290 freebase
   1788 os
   1770 copyvios
   1734 whois
   1683 denkmalliste
   1581 templatetiger
   1581 citations
   1541 list
   1471 jackbot
   1381 robots.txt
   1290 catnap
   1286 persondata
   1284 guc
   1234 refill
   1230 cluebotng
   1168 sigma
   1165 tools.wmflabs.org
   1092 fist
   1047 vcat
   1004 giftbot
    964 weeklypedia
    862 meta
    834 lexeme-forms
    781 xtools
    708 zoomviewer
    692 magnustools
    691 dewikinews-rss
    690 wikivoyage
    666 wikiloves
    652 ifttt-testing
    637 catscan2
    626 wiwosm
    597 typoscan
    592 commons-
    529 geocommons
    507 dplbot
    474 glamtools
    461 citation-template-filling
    445 autodesc
    417 potd-feed
    396 wikitrends
    389 eranbot
    388 wikisense
    360 templatecount
    330 iabot
    311 afdstats
    310 bibleversefinder
    309 ipcheck
    308 flickr2commons
    294 supercount
    294 manypedia
    293 wsexport
    273 quentinv57-tools
    261 isbn
    260 xtools-articleinfo
    216 mp
    190 flickr
    187 imagemapedit
    179 wikidata-todo
    179 isin
    174 locator
    172 bibleversefinder2
    168 templatetransclusioncheck
    161 redirectviews
    157 commonshelper
    151 awb
    143 blahma
    140 wpcleaner
    140 quick-intersection
    136 dump-torrents
    136
    131 paste
    131 citationhunt
    127 xtools-ec
    126 fountain
    118 mzmcbride
    117 copypatrol
    115 grep
    113 pbbot
    106 sighting
    106 periodibot
    102 wikishootme
     99 lrtool
     94 topviews
     94 stimmberechtigung
     93 magog
     93 hay
     92 video2commons
     91 erwin85
     90 intersect-contribs
     90 fiwiki-tools
     85 osm
     85 betacommand-dev
     85 admin
     81 ldap
     78 croptool
     76 jembot
     75 apple-touch-icon.png
     74 timescale
     74 favicon.ico
     72 massviews
     72 checkpersondata
     71 detox
     71 apersonbot
     66 interaction-timeline
     63 wembedder
     62 mix-n-match
     62 hatjitsu
     61 mathbot
     60 catfood
     60 apple-touch-icon-precomposed.png
     58 dupdet
     55 joanjoc
     51 sge-status
     51 intuition
     50 globalusagecount
     49 mapycz
     48 tfaprotbot
     48 langviews
     44 videoconvert
     43 wikiwatchdog
     43 hub
     41 jawi
     39 parliamentdiagram
     39 lists
     39 img
     37 sal
     37 readmore
     35 siteviews
     35 robin
     35 nppbrowser
     3
     29 dewkin
     28 whodunnit
     28 query2map
     28 pywikibot
     27 ptwikis
     26 templator
     24 userviews
     24 quickstatements
     24 mediawiki-feeds
     23 multichill
     22 render
     22 comprende
     21 pltools
     21 editgroups
     20 random-featured
     20 makeref
     20 itwikinews-rss
     20 ipinfo
     20 excel2wiki
     19 jarry-common
     19 authors
     18 missingtopics
     17 wiki
     16 wikidata-game
     16 projektneuheiten-feed
     16 panoviewer
     16 mediaviews
     15 xtools-pages
     15 wle
     15 images
     15 hashtags
     15 csp-report
     14 sqid
     14 spellcheck
     14 pb
     14 owintes
     13 wlm-stats
     13 krdbot
     13 citer
     13 book2scroll
     12 w-slackbot
     12 wdvd
     12 rightstool
     12 pagecount
     12 newbie-uploads
     12 kasparbot
     12 dawikitool
     12 connectivity
     12 commons-
     10 widar
     10 sowhy
     10 rxy
     10 request
     10 metaviews
     10 etytree
     10 dykautobot
      9 wlm-maps
      9 wiki-todo
      9 tooltranslate
      9 sourcemd
      9 sge-jobs
      9 mormegil
      9 missingpages
      9 liangent-toolserver
      9 ia-upload
      9 geograph2commons
      9 fengtools
      9 dykstats
      9 dexbot
      9 cite-o-meter
      9 catscan3
      9 anagrimes
      8 wmcharts
      8 userview
      8 svgcheck
      8 prop-explorer
      8 oabot
      8 monumental
      8 lp-tools
      8 ipp
      8 freddy2001
      8 editathonstat
      8 convert
      8 citing-bot
      7 wptestblog2
      7 wptestblog
      7 wikidata-exports
      7 trusty-tools
      7 tb-dev
      7 superyetkin
      7 replag
      7 popularpages
      7 h2bot
      7 edgars
      7 bene
      6 zoomable-images
      6 wmcounter
      6 tools.wmfla
      6 thibtools
      6 sql-optimizer
      6 relgen
      6 plus
      6 mediawiki-mirror
      6 map-of-monuments
      6 isbn2wiki
      6 hoo
      6 hauki
      6 derivative
      5 yadkard
      5 wikiradio
      5 wikilint
      5 wikihistory
      5 wikidata-terminator
      5 videotutorials
      5 stewardbots
      5 slumpartikel
      5 replacer
      5 orphantalk
      5 openstack-browser
      5 mrmetadata
      5 languagetool
      5 ip-range-calc
      5 grid-jobs
      5 commons-video-clicks
      5 checker
      4 yadfa
      4 wsm
      4 wlm-us
      4 wikidata-timeline
      4 wikidata-analysis
      4 url2commons
      4 tusc
      4 traffic-grapher
      4 tmp
      4 template-choice.php
      4 tabernacle
      4 static
      4 splinetools
      4 searchsbl
      4 patrolstats
      4 pageviews-test
      4 pagepile
      4 multidesc
      4 locator-tool
      4 limesmap
      4 lexeme-senses

      4 citeplato
      4 bub
      4 bawolff
      4 aivanalysis
      3 wwwroot.rar
      3 wscontest
      3 wikitext-deprecation
      3 wikiinfo
      3 wd-depicts
      3 vendor
      3 url-converter
      3 ukbot
      3 tools.zip
      3 tools.rar
      3 text2hash
      3 swviewer
      3 styleguide
      3 sbot
      3 russbot
      3 ruarbcom
      3 res
      3 pathway-viewer
      3 osm4cgi-bin
      3 oauth-hello-world
      3 not-in-the-other-language
      3 mediaviews-api
      3 integraality
      3 index.php
      3 iluvatarbot
      3 idsgen
      3 E5
      3 dibot
      3 delinker
      3 coord
      3 contact
      3 cobot
      3 cobain
      3 botwatch
      3 blockyquery
      3 bf.rar
      3 backup.rar
      3 author-disambiguator
      3 anomiebot
      2 zimmerbot
      2 yifeibot
      2 yichengtry
      2 ws-google-ocr
      2 ws-cat-browser
      2 wmukevents
      2 wikidata-redirects-conflicts-reports
      2 whichsub
      2 webarchivebot
      2 w
      2 usualsuspects
      2 urbanecmbot
      2 tsreports
      2 translate
      2 tool-db-usage
      2 tedbot
      2 tb
      2 svgworkaroundbot
      2 svgtranslate-test
      2 svg-map-maker
      2 soxred93
      2 snapshots
      2 slow-parse
      2 sign-language-browser
      2 sdm
      2 ruwikisource
      2 revertstat
      2 render-tests
      2 quickcategories
      2 ppp-sparql
      2 or.wikipedia.org
      2 ordia
      2 oojs-ui
      2 nppdash
      2 npp
      2 nikola
      2 montage
      2 meetbot
      2 mc8
      2 mbh
      2 listpages
      2 ktc
      2 jorobot
      2 jogotools
      2 isprangefinder
      2 iplookup
      2 interactoa
      2 inkowik
      2 huggle
      2 historicmaps
      2 hewiki-tools
      2 hashtags-test
      2 grantmetrics-test
      2 ~geohack
      2 genealog
      2 freefiles
      2 filedupes
      2 extreg-wos
      2 dschwen
      2 drtrigonbot
      2 doc
      2 dimastbkbot
      2 deadlinks
      2 dapete
      2 crosswatch
      2 commons-app-stats
      2 comidentgen
      2 cluestuff
      2 blankpages
      2 bibleversef
      2 awmd-stats
      2 artuploader
      2 ads.txt
      1 yunomi
      1 wp-login.php
      1 wikiviewstats2
      1 wikidata-compare
      1 wdpv
      1 wdprop
      1 wd-image-positions
      1 wam
      1 urdusign
      1 tptools
      1 topviews-test
      1 topicmatcher
      1 tool=youtube-channel
      1 tool=wiwosm
      1 tool=pmidtool
      1 tool=pltools
      1 tool=mdann52bot
      1 Tool_Labs_logo_thumb.png
      1 tool=coursestats
      1 tool=bambots
      1 tmg
      1 textcatdemo
      1 svgedit
      1 stereoskopie
      1 status
      1 statistics
      1 speedpatrolling
      1 smv-description-translations
      1 signpostlab
      1 samoabot

      1 pub
      1 proneval-gsoc17
      1 primerpedia
      1 phabricator-bug-status
      1 personabot
      1 pas
      1 pagevie
      1 page
      1 ores-support-checklist
      1 myrcx
      1 mwstew
      1 most-wanted
      1 mohib
      1 meta.wikimedia.org
      1 menubar.js
      1 map-search
      1 maintgraph
      1 main.css
      1 magnus
      1 list&_escaped_fragment_=
      1 lestaty
      1 lcm-dashboard
      1 krinkle-redirect
      1 it-wiki-users-leaflet
      1 itwiki
      1 itsource
      1 isbn-tmptest
      1 isa
      1 inactiveadmins
      1 import-500px
      1 hrwiki
      1 hroest
      1 hgztools
      1 hatjits
      1 hatjit
      1 hatji
      1 hatj
      1 hat
      1 hall-of-fame
      1 ha
      1 gutrs
      1 gridengine-status
      1 gpy
      1 globalsearch
      1 gerrit-patch-uploader
      1 gerakitools
      1 geohack.php
      1 geoha
      1 g
      1 fn
      1 five-million
      1 fatameh
      1 dschwenbot
      1 dockerregistry
      1 disclaim
      1 dataviz
      1 css
      1 contributionsurveyor
      1 churches
      1 cdnjs-beta
      1 cdnjs
      1 bub2
      1 bsaut
      1 bldrwnsch
      1 author=9
      1 author=8
      1 author=7
      1 author=6
      1 author=5
      1 author=4
      1 author=3
      1 author=2
      1 author=15
      1 author=14
      1 author=13
      1 author=12
      1 author=11
      1 author=10
      1 author=1
      1 assets
      1 ash-dev
      1 ascal
      1 api
      1 antigng-bot
      1 admin-beta
      1 adas
      1 actrial
      1 43.461111111111_N_-3.9252777777778_E_type:city

A lot of the crawled urls are malformed garbage, but many of them are things that could cause db lookups as @MusikAnimal hypothesized.

I think there are several questions to consider related to this:

  • Is blanket blocking at the reverse proxy a good idea in general?
  • Do we want to block all crawlers, or just some set of "bad" crawlers?
  • Would some tools need/want an opt-out switch to allow some/all crawlers?
  • How would we decide if the crawler list needed to be updated?
  • What kind of response would we give to the blocked user-agents?

I suggest that rather than a global block we think about this as a tooling issue -- maybe provide a default 'block everything' robots.txt (or even an actual service block) and a well-documented way for users to manage this.

Bstorm lowered the priority of this task from High to Medium.Feb 25 2020, 5:21 PM

I suggest that rather than a global block we think about this as a tooling issue -- maybe provide a default 'block everything' robots.txt (or even an actual service block) and a well-documented way for users to manage this.

This is now the case for all tools behind the Toolforge shared proxy following the resolution of T251628: Serve some default well known files for Toolforge webservices. The solution that was used there could also be done for the Cloud VPS proxy (catch 404 for /robots.txt and serve a default version).

I took this and used it as a grep filter to get some idea of how much traffic it would block:

@bd808 Hi! Could it be possible to re-run the log filter you made in the upper comment in 2019 to see how much crawler traffic has increased with the advent of LLM training? Some up to date robot names could be seen in this repo.

This seems to be a recurring problem for a number of tools. Unfortunately user-agent-based filtering doesn't work for a lot (of the most impactful, at least) of the AI scrapers, since they'll reportedly switch user agents if you start blocking them. It's still worth a shot I think, and I do think it is useful to do this at the Toolforge proxy level where possible. Most tool maintainers don't have the knowledge of what or how to block, nor are they likely to keep up with changes over time. It seems like maintaining a single blocklist would better protect the shared infrastructure, possibly with the ability for tools to opt-out.

Retellling the story of T400212:
The amount of disturbance has exceeded reasonable limits. BETA cannot be used by regulars since huge amounts of IP ranges and networks of regular and innochent providers are blocked. Tools cannot answer any more.

Both wmcloud.org and toolforge.org​ are suffering from traffic overload.

  • Obviously this is caused by crawlers and bots.
  • Rather than a wikipedia there is nothing to explore for search engines nor archives.

It should be ensured that no User Agent containing one of the following strings shall receive content:

archive.org_bot
AwarioBot
Amazonbot
bingbot
Brightbot
CCBot
ClaudeBot
DataForSeoBot
DotBot
DuckDuckBot
Googlebot
GPTBot
IABot
libwww-perl
MojeekBot
OAI-SearchBot
PerplexityBot
PetalBot
PriEcoBot
SemanticScholarBot
SemrushBot
SeznamBot
Thinkbot
TelegramBot
Twitterbot
YandexBot

A German technical village pump issue tells more.

  • The list has been collected from a current toolforge log file within 24 h one week ago.
  • The tool could not answer any query any more.
  • Especially Petal=Huawei caused the overload.
  • After filtering as described the tool answered quick and faster than ever.

Rather than implementing individual defensive action into every single tool, wmcloud.org and toolforge.org should maintain a common solution applied to both domains.

  • xtools@wmcloud are also suffering from overload.
  • In T393487#11024836 it is claimed that “these wikis all have robots.txt files that tell all crawlers to ignore the sites”.
    • Well, obviously not. Otherwise those queries would not have been found in recent log file.

On the other hand, the IP blocking at BETA should be terminated as soon as possible. IP ranges are not a good idea to distinguish bots from human beings over months.

A robots.txt file is only advisory guidance for well-behaved web bots:

The request says “no User Agent containing one of the following strings shall receive content”.

  • That does not mean: Oh, please, dear crawler, obey our robots.txt declaration!
  • It demands: No content (which is 403).
  • I do not care about good manners.
  • If disguised by using a regular browser identification that will need special action.

Please note the code of Wurgl; it will exit as soon as one of those strings is found.

Actually a common JSON config should be established:

"bots": [ "archive.org_bot", "awariobot", "amazonbot", ... ]
"others": [ "libwww-perl", "...", ... ]
"fakes": { "mozilla/5.0 (compatible; ie 10; wow64) like gecko/111 firefox/111": "128.0.",
           "mozilla/5.0 (compatible;...": "128.0." }

Then a performant shield should be created like the following pseudo code:

Scan = downcase( UserAgent )
IF instring( "bot", Scan ) THEN
   FOR i = ..., #bots
      IF instring( bots[ i ], Scan ) THEN
         exit 403
      END IF
   END FOR
ELSE
   FOR i = ..., #others
      IF instring( others[ i ], Scan ) THEN
         exit 403
      END IF
   END FOR
   Block = fakes[ Scan ]
   IF Block AND inRange( IPaddress, Block ) THEN
      exit 403
   END IF
END IF

Notes:

  • Not every bot access is undesired, they might be used by WMF community accounts. Only 3rd party search engines, archives and AI collectors are to be bounced back.
  • Faked UA looking like regular browser can be narrowed by IP range for 12 months to minimize collateral damage for humans with the real browser, but can be dropped after that fake has been replaced by another villain.
  • robots.txt would not be accessible for those bots. Well, does it matter? Before starting this procedure explicit access to robots.txt may be granted and bailed out.

anubis (https://anubis.techaro.lol/docs/) is a middleware tool that adds a challenge to suspicious requesters, that requires some computing time discouraging unsolicited users. It can be configured with a list of bad, strange or good user-agent, sending them a challenge or not depending of this information.
Beta cluster, or even all WMCS web proxy, could have it in front of all requests, with a default blocking policy to get out of the worst crawlers.
This soft is used on well-known websites with similar crawling issues, like https://git.kernel.org, https://gitlab.gnome.org, https://wiki.archlinux.org.
Another similar tool: https://git.gammaspectra.live/git/go-away

Related task on xTool tool: T400229: Add Anubis to XTools @MusikAnimal any feedback if you've already tested it?
Related task on ws-export tool: T392768: Can't connect to https://ws-export.wmcloud.org/

I agree robots.txt is useless. I have had everything blocked for years and it doesn't stop anything: https://xtools.wmcloud.org/robots.txt

I have not gotten very far with Anubis for XTools. Anyone interested is welcome to assist at T400229. I do have it running on my 3rd party wiki however, so I think I can confidently say that it works wonders. I saw an immediate halt of nearly all bot activity after I deployed Anubis.

A general note – Anubis needs the real IP address, so most tools will probably not benefit from it if they have their own installation. XTools and WS-Export were given explicit permission to have the X-Forwarded-For header enabled (which exposes the real IP) for counter-abuse purposes.

Note: XTools is now using Anubis in production, and it's worked well. (See conclusions at T400229.)

Currently I have more than 100.000 visits of php-pages per hour which is too much to be handled.

I added the following code to filter out bots

foreach ($list as $item) {
  work_miracles($item);
}<?php
  if(substr($_SERVER['HTTP_ACCEPT_LANGUAGE'] ?? '(null)', 0, 2) != 'de') {
    if(isset($userdb)) {
      // Code um die Bots fernzuhalten
      // In der letzten Minute maximal 20 Zugriffe (alle 3 Sekunden)
      // in den letzten 5 Minuten maximal 50 Zugriffe (alle 6 Sekunden)
      // in der letzten Stunde maximal 400 Zugriffe (alle 9 Sekunden)
      // in botlock sind dann so 8-9 Einträge
      $hashVal = hash('sha3-256',
        ($_SERVER['HTTP_USER_AGENT'] ?? '(null)') . '|' .
        ($_SERVER['HTTP_ACCEPT'] ?? '(null)') . '|' .
        ($_SERVER['HTTP_ACCEPT_LANGUAGE'] ?? '(null)') . '|' .
        ($_SERVER['HTTP_ACCEPT_ENCODING'] ?? '(null)') . '|' .
        ($_SERVER['LC_ALL'] ?? '(null)'));
      $query = 'INSERT INTO botaccess (timestamp, hash) VALUES(' . time() . ", '$hashVal')";
      $userdb->query($query, MYSQLI_USE_RESULT);

      $query = "SELECT 1 FROM botlock WHERE hash = '$hashVal'";
      $result = $userdb->query($query, MYSQLI_USE_RESULT);
      if($row = $result->fetch_object()) {
        // print $hashVal;
        http_response_code(403);
        exit(0);
      }
      $result->close();

      if(mt_rand(0, 100) == 17) {
        $query = 'DELETE FROM botaccess WHERE timestamp < ' . (time() - 3600);
        $userdb->query($query, MYSQLI_USE_RESULT);
        $query = 'SELECT hash FROM botaccess GROUP BY hash HAVING COUNT(*) > 400';
        $hashes = [];
        $result = $userdb->query($query, MYSQLI_USE_RESULT);
        while($row = $result->fetch_object()) {
          $hashes[$row->hash] = true;
        }
        $result->close();
        $query = 'SELECT hash FROM botaccess WHERE timestamp > ' . (time() - 300) . ' GROUP BY hash HAVING COUNT(*) > 50';
        $result = $userdb->query($query, MYSQLI_USE_RESULT);
        while($row = $result->fetch_object()) {
          $hashes[$row->hash] = true;
        }
        $result->close();
        $query = 'SELECT hash FROM botaccess WHERE timestamp > ' . (time() - 60) . ' GROUP BY hash HAVING COUNT(*) > 20';
        $result = $userdb->query($query, MYSQLI_USE_RESULT);
        while($row = $result->fetch_object()) {
          $hashes[$row->hash] = true;
        }
        $result->close();
        $query = 'TRUNCATE botlock';
        $userdb->query($query, MYSQLI_USE_RESULT);
        if(count($hashes) > 0) {
          $query = "INSERT IGNORE INTO botlock (hash) VALUES('" . implode("'), ('", array_keys($hashes)) . "')";
          $userdb->query($query, MYSQLI_USE_RESULT);
        }
      }

      return;
    }
}

In other words: Create a hash of the combination of user agent, accepted format, language, encoding and the local language. Add this with a timestamp in a table. Every ~100 successful visits delete all entries older than one hour and find those hashes with excess visits to a list of those to be blocked, usually up to 9 such entries. Since my tool (persondata) addresses german speakers, all with language "de" are allowed. In addition there is a robots.txt disallowing all bots – some behave friendly, some do not.

But still the webserver return sometimes a 301 which smells like the internal forward done by .lighttpd.conf

I do not really like it, but I see no other way to distinguish a useful access from a nasty (KI-)bot, some of those bots have a language zh-TW, zh-CN, zh-SG set, and some have en-US. But the real solution should be elsewhere, at a level with a visible IP-address.

But the real solution should be elsewhere, at a level with a visible IP-address.

I don't disagree that better blocking tools are needed, but we have also generally entered an era of website content harvesting where the bots are IP hoping very systematically such that IP blocks are nearly useless. These bots are also using "residential proxies" to mix their traffic in with normal humans as much as they can. In Beta Cluster I recently saw 33,440 separate /24 networks each sending 2 requests to the webservers in a 24 hour period.