Maniphest T194541

Investigation: Why is there a Google Proxy API usage spike every 5 days?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kaldari
	May 11 2018, 8:20 PM

Description

It looks like we're hitting the daily quota every 5 days exactly due to a regularly timed spike. On April 26, May 1, May 6, and May 11, there were huge spikes in Google Search API usage from Tool Forge resulting in hitting the quota and then being denied service for the rest of the day. The latest spike was from about 1am to 2am (PST) on May 11. This is most likely coming from Earwig's Copyvio Tool (or its API), but we should do some logging and check.

API documentation: https://wikitech.wikimedia.org/wiki/Nova_Resource:Google-api-proxy

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		CommunityTechBot	T193559 Copyvio detection tool cannot use Google search engine
		Resolved		kaldari	T194541 Investigation: Why is there a Google Proxy API usage spike every 5 days?

Event Timeline

kaldari triaged this task as High priority.May 11 2018, 8:20 PM

kaldari created this task.

Urbanecm moved this task from Backlog to Watching on the User-Urbanecm board.May 11 2018, 8:21 PM

@bd808: Do you know if there's any existing logging for this that we could look at?

kaldari renamed this task from Investigation: Why is there a Google Proxy API usage spike every 5 days to Investigation: Why is there a Google Proxy API usage spike every 5 days?.May 11 2018, 8:22 PM

kaldari updated the task description. (Show Details)

Maybe bots crawling the wiki and following the link in https://en.wikipedia.org/wiki/Template:Copyvios (or elsewhere)? This happens with XTools. We are blocking bot-like UAs in the Apache config.

Edit: It looks like ruwiki links to Copyvios in revision histories: https://ru.wikipedia.org/wiki/MediaWiki:Histlegend

In T194541#4201510, @kaldari wrote:

@bd808: Do you know if there's any existing logging for this that we could look at?

There are access logs for the proxy in /var/log/nginx on google-api-proxy-02.google-api-proxy.eqiad.wmflabs which are kept for ~14 days. We don't have per-consumer API tokens, but I would hope that you can figure things out from the user-agent fields for each request.

I checked the nginx access.log, I only see two UAs (which I think are safe to mention) -- earwigbot and Googlebot. The latter was only a handful of requests. That makes sense, after all, because it's the Copyvios tool that's using this proxy service and not the end user, right? So we'd maybe need to check the Copyvios access.log to see if there's some automated disruption.

In T194541#4201550, @MusikAnimal wrote:

I checked the nginx access.log, I only see two UAs (which I think are safe to mention) -- earwigbot and Googlebot. The latter was only a handful of requests. That makes sense, after all, because it's the Copyvios tool that's using this proxy service and not the end user, right?

Correct. I pulled the user-agent data from all the current logs and EarwigBot has made 98.77% of requests.

If anybody wants to repeat this feat, here's a script to do it:

access.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import bz2
import collections
import fileinput
import gzip
import os
import re


RE_LINE = re.compile(
    r'(?P<ipaddr>[^ ]+) '
    r'(?P<ident>[^ ]+) '
    r'(?P<userid>[^ ]+) '
    r'\[(?P<datetime>[^\]]+) \+0000\] '
    r'"(?P<verb>[^ ]+) (?P<uri>[^ ]*) HTTP/[^"]+" '
    r'(?P<status>\d+) '
    r'(?P<bytes>\d+) '
    r'"(?P<referer>[^"]*)" '
    r'"(?P<ua>[^"]*)"'
)


def parse_log(lines, logpat):
    groups = (logpat.match(line) for line in lines)
    return (g.groupdict() for g in groups if g)


def hook_compressed_text(filename, mode, encoding='utf8'):
    # From https://stackoverflow.com/a/21529243/8171
    ext = os.path.splitext(filename)[1]
    if ext == '.gz':
        return gzip.open(filename, mode + 't', encoding=encoding)
    elif ext == '.bz2':
        return bz2.open(filename, mode + 't', encoding=encoding)
    else:
        return open(filename, mode, encoding=encoding)


if __name__ == '__main__':
    uas = collections.defaultdict(int)
    for r in parse_log(
        fileinput.input(openhook=hook_compressed_text),
        RE_LINE
    ):
        uas[r['ua']] += 1

    for ua, hits in sorted(uas.items()):
        print('{}: {}'.format(ua, hits))

Use it like sudo ./access.py /var/log/nginx/access.log*.

So we'd maybe need to check the Copyvios access.log to see if there's some automated disruption.

That sounds like a good next step.

MusikAnimal added a comment.May 11 2018, 8:53 PM

This comment was removed by MusikAnimal.

OK, so starting around 2018-05-11 at 07:40, someone hammers the tool for two hours copyvio-checking a bit over a thousand AfC drafts. They're not using the API, but the sheer rate definitely makes it look like an automated process. They're checking mostly active drafts, but some declined submissions that haven't been touched in months as well. The URLs all have the same format as the copyvio check link in the submission template, a format which probably wouldn't arise if you were generating the URLs yourself, so I suspect it's some web crawler with a predictable activity pattern. I can't imagine why a person would behave in this manner, nor a real Wikipedia bot.

Unfortunately, I'm not logging user agents. Is that information stored in real nginx access logs somewhere that I can't see? (All I have are uwsgi logs and internal logs from the detection engine.)

@Earwig I still think requiring logins on the copyvios tool would go a long way towards saving us time investigating rogue bots using the tool like this. I don't see any obvious disadvantages to logging in. The user does not have to give the bot permission to access its info every time they login, if that's your concern.

I agree requiring logins might be good, at least for non-API requests. Given this is a power user tool it will help wean out the people/bots just clicking things and not actually using the service.

I'll share a list of crawler/spider user agents that are being blocked in XTools for this same reason -- HTML-only requests apparently originating from on-wiki links -- in case you want to try that first. Similar to Copyvios, XTools does heavy operations and we can't afford these illegitimate requests. I think you are safe to disallow these (even for API requests):

(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|Yahoo! Slurp|BehloolBot|MJ12bot)

I don't know if robots.txt works for individual Toolforge tools. We have everything disallowed at https://xtools.wmflabs.org/robots.txt, but the above bots apparently don't honour it.

We also block Python-urllib, which I think may be a legitimate library, but I observed obvious misuse. So if you do block that you might want to whitelist it for API requests.

Robots.txt is not blocking, its advising what the bot should not do. Non-legitimate bots probably won't honour it (they do not want to be nice to your tool).

In T194541#4202608, @Earwig wrote:

Unfortunately, I'm not logging user agents. Is that information stored in real nginx access logs somewhere that I can't see? (All I have are uwsgi logs and internal logs from the detection engine.)

Top 10 hits by UA to the copyvios tool (just a grep):

root@tools-proxy-02:~# grep '/copyvios' /var/log/nginx/access.log* | python3 access.py /dev/stdin | tail
The Knowledge AI: 107
Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/): 111
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0: 177
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36: 183
SQLBot: 301
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0: 307
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36: 353
Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/spd): 385
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots): 1882
Mozilla/5.0 (TweetmemeBot/4.0; +http://datasift.com/bot.html) Gecko/20100101 Firefox/31.0: 2849

The access.py was T194541#4201556 plus this patch:

2a3
> # https://phabricator.wikimedia.org/T194541#4201556
12a14
>     r'(?P<host>[^ ]+) '
49c51
<     for ua, hits in sorted(uas.items()):
---
>     for ua, hits in sorted(uas.items(), key=lambda e: e[1]):

OK, I'll try blocking the bot user agents from above and in @MusikAnimal's comment. If that doesn't reduce the rate on the 16th, or we still want to implement additional protections, we'll go for @Niharika's suggestion of requiring logins. (I'm not sure how this would integrate with the API, though.)

Edit: User agents blocked (in uWSGI).

MusikAnimal mentioned this in T193559: Copyvio detection tool cannot use Google search engine.May 14 2018, 1:11 AM

• TBolliger moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.May 14 2018, 4:10 PM

MusikAnimal mentioned this in T193809: [Timebox 12hr] Investigation: applying copyvio for page review prioritization.May 14 2018, 8:53 PM

• TBolliger moved this task from Needs Discussion to Older: Tracking Work by Others on the Community-Tech board.May 15 2018, 11:14 PM

Number of queries went down dramatically after the 13th. Guess we'll see if it did the trick on the 18th.

18th is over. Was there any change?

In T194541#4203327, @Earwig wrote:

OK, I'll try blocking the bot user agents from above and in @MusikAnimal's comment. If that doesn't reduce the rate on the 16th, or we still want to implement additional protections, we'll go for @Niharika's suggestion of requiring logins. (I'm not sure how this would integrate with the API, though.)

Edit: User agents blocked (in uWSGI).

Can you share the technical way of UA blocking you used "in uWSGI" with me (you can use my email if you want as it is slightly off topic here).

Looks like there was no increase in tool usage on the 18th, but I don't have the exact number of Google API queries made readily available.

In my ~/www/python/uwsgi.ini, I have this line (described here):

route-user-agent = .*(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|YandexBot|twitterbot|TweetmemeBot|SeznamBot|datasift\.com/bot|BehloolBot|MJ12bot|naver\.me/spd).* break:403 Forbidden

Ideally we would block in nginx, but this is fine for the number of hits the tool gets.

@Earwig, @MusikAnimal: There have been no usage spikes at all since the change was made on the 13th and the overall usage is down dramatically. Nice work both of you!

Liuxinyu970226 unsubscribed.May 23 2018, 3:50 PM

SQL subscribed.Jun 5 2018, 2:25 AM

Edgars2007 subscribed.Jun 7 2018, 9:27 PM

• Vvjjkkii renamed this task from Investigation: Why is there a Google Proxy API usage spike every 5 days? to e3caaaaaaa.Jul 1 2018, 1:12 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed kaldari as the assignee of this task.