There have been some massive spikes in requests to ORES that have made the service unreliable. What's a good way to handle this type of problem?
I pulled the IP address and created a private subtask to temporarily block or throttle this client: T148356. Since our thread here is a [discussion] and not a spike, I'll marked the netops task as pending a second key-turner.
The traffic doesn't look particularly malicious at all, just reasonably impolite. Its user-agent string is nothing but the slippery "Ruby".
The action taken is probably the best around in a DoS scenario like this one. Since the user is, in all appearances a good-faith one, we should unblock him ASAP (I assume they 'll ask for a bot flag).
In the meantime we had already planned to double the capacity of the SCB cluster this week. I 'll try to finish this part today.
I'm still trying to understand the cause of the spike, and would like help interpreting a few things.
select * from wmf.webrequest where uri_host = 'ores.wikimedia.org' and year=2016 and month=10 and day=17 and hour=6 order by dt limit 100000;
-> 23,389 rows selected
Comparing the Hive webrequest counts with Grafana, I believe that the Y axis is scaled incorrectly. For example, during the interval [06:20:00, 06:25:00), there are 4,148 requests logged in Hive. Grafana reports 2,811 + 2,510 = 5,321 requests *per minute* for that period but that's five times higher than the webrequest table. It seems more likely that the Grafana scale is requests per granularity unit, five minutes in this case, and should be divided by five. The correct load average should be 4148 / 5 = about 830 requests / minute during the spike.
Another scary detail is the number of cache hits for WMF precache requests--there should be *zero*, but about a fifth of our responses are cache hits. What cache is this? The webrequest documentation ominously declares,
Cache status (this field is wrong and should be removed)
--however, the supposed cache hits are indeed served in under a millisecond, which makes me think they are web frontend cached--but how? Spot checking a few, I don't see earlier requests which would have primed the cache, maybe we should look through application logs. Varnish caching isn't enabled, either... Maybe we're pulling the cache flag from application response headers?
The TTFB is terribly inconsistent during this entire hour, it's a bad failure mode, and lots of 15 second timeouts.
I'd like to have a graph of Celery queue size vs TTFB. I'll study the graphs at https://grafana.wikimedia.org/dashboard/db/ores
I'm suspicious of the "burstiness" explanation for this episode. The trouble starts around 2016-10-16 18:45, when there is a flurry of timeout errors, not caused by any obviously corresponding spike in request load:
Something happens with the worker memory and CPU at around that time--maybe a cron job?
A bit *later*, when the system is recovering from whatever happened, we're hit by something odd: the total request load stays constant, but the "scoring requests" graph shows a higher proportion of... something. It's unclear whether the cache hit rate is going up or down, but it's doing something.
Then, we go into a tailspin where response time and timeouts climb for the next few hours. We need a graph of worker queue size. Seems that our performance tanks when we get any backlog at all?