Page MenuHomePhabricator

[Spec] Tracking and blocking specific IP/user-agent combinations
Closed, ResolvedPublic

Description

What if someone is abusing the service? How do we know? How do we take them out?

Details

Related Gerrit Patches:

Event Timeline

Halfak created this task.Jun 16 2016, 1:31 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 16 2016, 1:31 PM
Halfak added a subscriber: BBlack.

@BBlack, we're looking at building good mechanisms for mitigating DOS attacks in ORES. How is this done elsewhere? Any protips.

It seems like there's a few different purposes for this task:

  1. Gather metrics to analyze use of the service
  2. Prevent accidental (or purposeful) DoS
  3. Be able to ban specific IPs and/or user-agents

My feeling is that these should be independent tasks since they deal with different things (logging, rate-limiting, and blacklisting).

Halfak triaged this task as Medium priority.Jul 5 2016, 2:29 PM

To expand a bit on my previous comment

Gather metrics to analyze use of the service

This feels like it should live within the app itself (maybe as part of metrics_collectors) so that it can have additional context as necessary for recording usage metrics.

Prevent accidental (or purposeful) DoS

This could live in the endpoint configuration (like nginx's limit_req) for a simple implementation. However, this could artificially limit the bandwidth of the infrastructure. I think a more robust solution would be scheduling scoring requests on a round-robin basis with queues per user-agent/IP.

Be able to ban specific IPs and/or user-agents

I think we need a terms of use (or similar) to clearly define what constitutes as 'abuse' before going down this route, but it should be simple enough to implement using the web endpoint's configuration (Example for nginx: here).

I don't see a nice way that we can use metrics_collector to track individual user-agents. If we collected user-agents as part of the key, we'd end up with too many keys.

Do you think engineering a separate processing queue per IP will be overly complicated? It seems like setting a very high bound on limit_req would be a good start. E.g. the limit is high enough that a single key (IP/user-agent) can't knock over the service single-handedly.

Re. defining abuse, I'm not sure that we do. If some activity falls outside of that definition but causes trouble for our service, we ought to reserve the right to block requests.

You wouldn't necessarily need a separate processing queue per IP, but probably tuned more to the number of concurrent users of the service. Then use those set number of queues as buckets for hashed user-agent/IPs. Celery seems to use round-robin between different queues. This would get us in the ballpark of a good solution, without the overhead of unique queues per user.

Granted, this would need to be verified as a feature we can depend on before depending on it ;)

So, you're imagining that we'd have workers adaptively draw from a set Q of queues were each queue in Q represents a unique IP/user-agent pair? That sounds very complex. How will the workers know when a new queue is inserted into Q? Won't a new queue effectively jump ahead of the rest of Q?

In the end, I'm still lost on what problem this actually solves.

No, you would have a static number of queues, S, that is picked based on how many concurrent users we expect to get under load. As the requests come in, the user-agent/IP, U, is hashed and the request is sent to queues[U % S].

The problem this solves is to not limit the throughput of the system artificially when it still has capacity to process more tasks.

Sorry I wasn't clear. It seems a single queue would not artificially limit the throughput of the system when it still has the capacity to process more tasks. So what problem does this solve beyond what a single queue could do?

It prevents starvation.

Sorry. Can you elaborate?

This is so one user-agent/IP doesn't hog all the resources and other users still are able to use the service.

ores currently routes through cache_misc for termination, so to some degree this falls under general DoS protections for all such services that are handled generically at the outer edge. A few related points:

  1. cache_misc is already relatively-resilient just by the fact that it's a globally-distributed cluster with multiple frontend IPs in different regions, and multiple frontend hosts in each region.
  2. In emergencies (active DoS causing problems), we do have the ability we've used in the past to push out emergency VCL patches to block specific traffic patterns.
  3. Generic ratelimiting (e.g. per client IP) and other similar protection measures for these clusters has been pushed off for post-varnish4, but we do have plans around that.
  4. One of the best defenses you can have is to be sure that unauthenticated URLs are reasonably-cacheable (e.g. images and the root login page itself, etc). If the service emits good cacheability headers, the front edge can absorb a lot of load on cache hits. At a first cursory glance, even https://ores.wikimedia.org/ doesn't look like it's cacheable.
Tgr added a subscriber: Tgr.Feb 13 2017, 8:24 PM

This is so one user-agent/IP doesn't hog all the resources and other users still are able to use the service.

So instead of the abuser starving all other users, they would only starve the 1/S fraction of users who happen to fall in the same IP bucket. That's an improvement (especially if internal IPs have their separate bucket so Wikimedia services cannot be affected by an external user DoSing ORES) but still not ideal.

IMO having a separate queue (or some similar prioritization method) for internal requests (ReviewStream etc) is a good idea and should be done. Beyond that, this is not a sufficient solution and we won't need it once we have a sufficient solution (such as T148997) is in place so IMO not worth doing.

Tgr added a comment.Feb 13 2017, 9:51 PM
  1. One of the best defenses you can have is to be sure that unauthenticated URLs are reasonably-cacheable (e.g. images and the root login page itself, etc). If the service emits good cacheability headers, the front edge can absorb a lot of load on cache hits. At a first cursory glance, even https://ores.wikimedia.org/ doesn't look like it's cacheable.

Given that an ORES response returns data for an arbitrary collection of revisions, that would probably be a bad idea.

  1. One of the best defenses you can have is to be sure that unauthenticated URLs are reasonably-cacheable (e.g. images and the root login page itself, etc). If the service emits good cacheability headers, the front edge can absorb a lot of load on cache hits. At a first cursory glance, even https://ores.wikimedia.org/ doesn't look like it's cacheable.

Given that an ORES response returns data for an arbitrary collection of revisions, that would probably be a bad idea.

We can talk about cache strategies for the complex and/or dynamic cases (there are usually smart ways to do it), but right now ORES doesn't let anything cache, even the most basic of cases. e.g.:

Static home page at https://ores.wikimedia.org/ is not cacheable at all.
Static logo at https://ores.wikimedia.org/static/images/ores_logo.svg is not cacheable at all.

Many "attacks" or even unintentional abuses go completely un-noticed and un-noteworthy on most of our sites, because we have a robust, globally distrubuted cache front edge that withstand a fairly large about of traffic. But if you disable all caching, at least one easy layer of defense has been taken out of play.

Tgr added a comment.Feb 13 2017, 11:40 PM

Static home page at https://ores.wikimedia.org/ is not cacheable at all.
Static logo at https://ores.wikimedia.org/static/images/ores_logo.svg is not cacheable at all.

That should be fixed when someone has enough free time, but is a fairly insignificant problem: these URLs don't get huge spikes of bot-generated traffic (this task has been phrased in terms of an "attack" but the real-world problems that led to its creation were over-eager API clients), and they are very cheap to serve (so even if someone is doing an intentional DoS attack, they are not not the most exploitable targets).

Compare to https://en.wikipedia.org/wiki/Special:ApiSandbox which is also uncacheable and also gets a tiny ammount of traffic compared to the API itself. These kind of pages are way down on the priority list for "we need to figure out how page X behaves under heavy load", because the chance that they get heavy load is so small.

This comment was removed by Halfak.

https://ores.wikimedia.org/static/images/ores_logo.svg

Returns with: cache-control:public, max-age=43200 How is that not cacheable? I'd like to address the issue since this seems like an easy fix.

Tgr added a comment.EditedFeb 14 2017, 6:23 PM

That's weird. I see no obvious reason why Varnish refuses to cache them. (OTOH https://ores.wikimedia.org/wikimedia-ui-static/MW/mediawiki.min.css is served without Content-Length which makes Varnish reject it since it tries to avoid caching huge responses.)

On the browser side, these resources do get cached properly. What seems slightly wrong (but probably has no impact in practice) is that If-None-Match/If-Modified-since headers get ignored.

$ curl 'https://ores.wikimedia.org/static/images/ores_logo.svg' -H 'if-none-match: W/"flask-1486669041.2388084-1758-3858640437"' -H 'accept-encoding: gzip, deflate, sdch, br' -H 'accept-language: en-US,en;q=0.8,hu;q=0.6' -H 'cookie: GeoIP=US:CA:Oakland:37.79:-122.12:v4; CP=H1; WMF-Last-Access=14-Feb-2017; WMF-Last-Access-Global=14-Feb-2017' -H 'if-modified-since: Thu, 09 Feb 2017 19:37:21 GMT' --compressed -I
HTTP/1.1 200 OK
Date: Tue, 14 Feb 2017 18:22:28 GMT
Content-Type: image/svg+xml
Content-Length: 1758
Connection: keep-alive
Last-Modified: Thu, 09 Feb 2017 19:30:05 GMT
Cache-Control: public, max-age=43200
Expires: Wed, 15 Feb 2017 06:22:28 GMT
ETag: "flask-1486668605.7975135-1758-3858640437"
Access-Control-Allow-Origin: *
X-Varnish: 54719093, 98344173, 32942769, 1832709
Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
Accept-Ranges: bytes
Age: 0
X-Cache: cp1061 pass, cp2006 pass, cp4001 pass, cp4002 pass
X-Cache-Status: pass
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Analytics: WMF-Last-Access=14-Feb-2017;WMF-Last-Access-Global=14-Feb-2017;https=1

That should be a 304, not a 200. But the response does get cached by the browser, so meh.

Change 344071 had a related patch set uploaded (by BBlack):
[operations/puppet] cache_misc: support WMF-Last-Access-Global cookie

https://gerrit.wikimedia.org/r/344071

Re: 304-vs-200, I was able to get some 304s, but only when I dropped the INM and relied on IMS. It seems like the ETags might be inconsistent between serial requests to ores for the same resource? (the LM timestamps are too a bit).

As for the cacheability issues in general, it's that cache_misc hadn't been updated to account for the relatively-new WMF-Last-Access-Global cookie. We ignore a list of standard non-varying cookies like those on cache_misc (some of our own analytical ones as well as Google's), and then treat any other cookie as cache-busting (as we've never enumerated all the application-specific login cookies for all of the varied services behind cache_misc). The patch above should fix that particular issue.

Change 344071 merged by Ema:
[operations/puppet] cache_misc: support WMF-Last-Access-Global cookie

https://gerrit.wikimedia.org/r/344071

Re: 304-vs-200, I was able to get some 304s, but only when I dropped the INM and relied on IMS. It seems like the ETags might be inconsistent between serial requests to ores for the same resource? (the LM timestamps are too a bit).

Tracked in T164188.

Tgr added a comment.May 1 2017, 1:46 PM

Getting back on the main track (how to stop non-cacheable excessive usage like the one in T157206): blocking specific IPs or user agents in Varnish should not be problematic, right? Even if the block is only for accessing ORES? While not ideal, that seems like a good enough answer to the question in the task description for now.

(As for detection, ORES getting overloaded is not catastrophic, especially if T146664 is resolved. So relying on the existing overload warning should be fine.)

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptSep 24 2018, 7:25 AM
Ladsgroup moved this task from Incoming to Done on the User-Ladsgroup board.Oct 2 2018, 8:21 PM