Page MenuHomePhabricator

[Discuss] DOS attacks on ORES. What to do?
Closed, ResolvedPublic

Description

There have been some massive spikes in requests to ORES that have made the service unreliable. What's a good way to handle this type of problem?

Event Timeline

Halfak triaged this task as High priority.Oct 16 2016, 10:31 PM
Halfak added a subscriber: akosiaris.

@akosiaris, I figure you'd have some ideas. What do you advise that we do about this?

Do we know that these are malicious attacks? Or are they consumers who are just overloading it?

https://graphite.wikimedia.org/S/Bp

Looks like someone requesting scores for the wp10 model one at a time.

Is there contact information in their user agent? I'd just block them that way (400 or something) until we can talk to them and have them use batching, etc.

I pulled the IP address and created a private subtask to temporarily block or throttle this client: T148356. Since our thread here is a [discussion] and not a spike, I'll marked the netops task as pending a second key-turner.

The traffic doesn't look particularly malicious at all, just reasonably impolite. Its user-agent string is nothing but the slippery "Ruby".

We talked about it in #wikimedia-operations in IRC. It seems it was @Daniel_Mietchen doing 142 edits per minute in Wikidata without the bot flag which is against bots policy. I blocked him and explained in his talk page. I hope things are alright now.

The action taken is probably the best around in a DoS scenario like this one. Since the user is, in all appearances a good-faith one, we should unblock him ASAP (I assume they 'll ask for a bot flag).

In the meantime we had already planned to double the capacity of the SCB cluster this week. I 'll try to finish this part today.

We were getting upwards of 7900 requests per minute. That couldn't be Daniel's bot, could it?

No, there seems to be a general issue right now. It's clear in #wikimedia-operations.

It looks like we just doubled capacity. And we've filed a task for T148594: Implement prioritization of request processing

I think that we're done with this [Discuss]ion for now.

Halfak claimed this task.

I'm still trying to understand the cause of the spike, and would like help interpreting a few things.

select *
from wmf.webrequest
where
    uri_host = 'ores.wikimedia.org'
    and year=2016 and month=10 and day=17 and hour=6
order by
    dt
limit 100000;

-> 23,389 rows selected

Comparing the Hive webrequest counts with Grafana, I believe that the Y axis is scaled incorrectly. For example, during the interval [06:20:00, 06:25:00), there are 4,148 requests logged in Hive. Grafana reports 2,811 + 2,510 = 5,321 requests *per minute* for that period but that's five times higher than the webrequest table. It seems more likely that the Grafana scale is requests per granularity unit, five minutes in this case, and should be divided by five. The correct load average should be 4148 / 5 = about 830 requests / minute during the spike.

Another scary detail is the number of cache hits for WMF precache requests--there should be *zero*, but about a fifth of our responses are cache hits. What cache is this? The webrequest documentation ominously declares,

Cache status (this field is wrong and should be removed)

--however, the supposed cache hits are indeed served in under a millisecond, which makes me think they are web frontend cached--but how? Spot checking a few, I don't see earlier requests which would have primed the cache, maybe we should look through application logs. Varnish caching isn't enabled, either... Maybe we're pulling the cache flag from application response headers?

The TTFB is terribly inconsistent during this entire hour, it's a bad failure mode, and lots of 15 second timeouts. I'd like to have a graph of Celery queue size vs TTFB. I'll study the graphs at https://grafana.wikimedia.org/dashboard/db/ores

I'm suspicious of the "burstiness" explanation for this episode. The trouble starts around 2016-10-16 18:45, when there is a flurry of timeout errors, not caused by any obviously corresponding spike in request load:
https://grafana.wikimedia.org/dashboard/db/ores?from=1476641866834&to=1476648723086

Something happens with the worker memory and CPU at around that time--maybe a cron job?

A bit *later*, when the system is recovering from whatever happened, we're hit by something odd: the total request load stays constant, but the "scoring requests" graph shows a higher proportion of... something. It's unclear whether the cache hit rate is going up or down, but it's doing something.

Then, we go into a tailspin where response time and timeouts climb for the next few hours. We need a graph of worker queue size. Seems that our performance tanks when we get any backlog at all?