Page MenuHomePhabricator

Address mass overload errors in ORES (July 2018, UW origin)
Closed, ResolvedPublic

Description

At 0730 UTC on July 25th, ORES began to be hammered by ~200 requests per second from an IP address within the University of Washington. This resulted in the complete overload of CODFW and a spike in the Overload Error rate graph. See https://grafana.wikimedia.org/dashboard/db/ores?panelId=9&fullscreen&orgId=1&from=1532493325359&to=1532528355313

Event Timeline

Halfak created this task.Jul 25 2018, 2:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2018, 2:21 PM
Halfak added a subscriber: akosiaris.EditedJul 25 2018, 2:22 PM

I sent out an email to my collaborators at UW asking them about the event.

@akosiaris blocked the originating IP address at ~1410 UTC and the overload seems to have stopped.

Halfak added a subscriber: elukey.Jul 25 2018, 2:23 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-25T16:40:25Z] <akosiaris> remove ORES abuser blocking T200338, let's reevaluate

Was a UW researcher. I'll work with her to continue :)

@Halfak as another follow up step, I'd also add more monitoring to catch these situations. We noticed the issue because Jaime was watching 503s in logstash for another independent issue, meanwhile an alert would have been better :)

awight added a subscriber: awight.Jul 25 2018, 5:59 PM

Just a minor note: in the past, overload events like this have resulted in the collapse of ORES worker nodes, but in this case the workers continued to serve results at their maximum capacity. I'm not sure there's anything actually pathological about what happened, this is exactly the behavior we hope the service exhibits.

An alert makes sense because the service is made less available for other clients.

The main followup work I'd like to see is that we find a way to hard throttle good-faith users like this researcher, maybe limiting the number of parallel connections from a single IP at the network layer.

Ladsgroup closed this task as Resolved.Nov 1 2018, 8:00 PM