Page MenuHomePhabricator

Chinese scraper (?) with multiple IP addresses overloading wsexport
Closed, ResolvedPublic

Description

The requests come from a variety of Chinese IP addresses. Requests are of the form

<ip> - - [29/Dec/2015:12:24:55 +0000] "GET /wsexport/tool/book.php?lang=fr&format=pdf-a5&page=La_Fortune_de_Gaspard HTTP/1.1" 200 97999 "-" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36"

and are hitting wsexport every ~20 seconds. As loading the page takes much more than that, this is effectively killing wsxport.

All are using the same odd user agent

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36

which is a year-old Chrome version on Windows XP. As Chrome auto-updates, this suggests to me it's a scraper lying about the user agent.

This is linked to the following IP addresses, all Chinese:

valhallasw@tools-proxy-01:~$ sudo tail -n 100000 /var/log/nginx/access.log | grep wsexport | grep  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36" | cut -d"-" -f1 | sort |uniq -c
     41 119.188.12.11
     10 119.188.12.7
     36 119.188.50.138
      2 218.26.232.136
     10 218.26.232.164
    111 61.54.24.78

and this makes up for 210 of the 320 most recent requests to wsexport.

Blocking these using the user agent is probably the most effective.