What if someone is abusing the service? How do we know? How do we take them out?
|operations/puppet : production||cache_misc: support WMF-Last-Access-Global cookie|
|Resolved||Ladsgroup||T137962 [Spec] Tracking and blocking specific IP/user-agent combinations|
|Resolved||Ladsgroup||T160692 Use poolcounter to limit number of connections to ores uwsgi|
|Resolved||Ladsgroup||T201823 Implement PoolCounter support in ORES|
|Resolved||akosiaris||T203465 Site: 4 VM request for ORES poolcounter|
|Resolved||akosiaris||T201824 Spin up a new poolcounter node for ores|
|Resolved||Ladsgroup||T201825 Test poolcounter support for ores in beta cluster|
|Resolved||Ladsgroup||T201826 Implement support for whitelisting and proxy requests for poolcounter in ORES|
|Declined||Ladsgroup||T204897 Add Wiki Education Dashboard and Programs & Events Dashboard to ORES connection whitelist|
|Resolved||Tgr||T161029 Forward request data in proxied Action API modules|
- Mentioned In
- T164188: Check caching headers in ORES responses
T162484: DRAFT: Use rate limiting for ORES Action API score retrieval
T163687: Re-enable ORES data in action API
T159615: [spec] Active-active setup for ORES across datacenters (eqiad, codfw)
T148997: Implement parallel connection limit for querying ORES
T135495: Metrics around distinct user agents that hit ORES
- Mentioned Here
- T160692: Use poolcounter to limit number of connections to ores uwsgi
T146664: Limit resources used by ORES
T157206: ORES Overloaded (particularly 2017-02-05 02:25-02:30)
T164188: Check caching headers in ORES responses
T148997: Implement parallel connection limit for querying ORES
It seems like there's a few different purposes for this task:
- Gather metrics to analyze use of the service
- Prevent accidental (or purposeful) DoS
- Be able to ban specific IPs and/or user-agents
My feeling is that these should be independent tasks since they deal with different things (logging, rate-limiting, and blacklisting).
To expand a bit on my previous comment
Gather metrics to analyze use of the service
This feels like it should live within the app itself (maybe as part of metrics_collectors) so that it can have additional context as necessary for recording usage metrics.
Prevent accidental (or purposeful) DoS
This could live in the endpoint configuration (like nginx's limit_req) for a simple implementation. However, this could artificially limit the bandwidth of the infrastructure. I think a more robust solution would be scheduling scoring requests on a round-robin basis with queues per user-agent/IP.
Be able to ban specific IPs and/or user-agents
I don't see a nice way that we can use metrics_collector to track individual user-agents. If we collected user-agents as part of the key, we'd end up with too many keys.
Do you think engineering a separate processing queue per IP will be overly complicated? It seems like setting a very high bound on limit_req would be a good start. E.g. the limit is high enough that a single key (IP/user-agent) can't knock over the service single-handedly.
Re. defining abuse, I'm not sure that we do. If some activity falls outside of that definition but causes trouble for our service, we ought to reserve the right to block requests.
You wouldn't necessarily need a separate processing queue per IP, but probably tuned more to the number of concurrent users of the service. Then use those set number of queues as buckets for hashed user-agent/IPs. Celery seems to use round-robin between different queues. This would get us in the ballpark of a good solution, without the overhead of unique queues per user.
Granted, this would need to be verified as a feature we can depend on before depending on it ;)
So, you're imagining that we'd have workers adaptively draw from a set Q of queues were each queue in Q represents a unique IP/user-agent pair? That sounds very complex. How will the workers know when a new queue is inserted into Q? Won't a new queue effectively jump ahead of the rest of Q?
In the end, I'm still lost on what problem this actually solves.
No, you would have a static number of queues, S, that is picked based on how many concurrent users we expect to get under load. As the requests come in, the user-agent/IP, U, is hashed and the request is sent to queues[U % S].
The problem this solves is to not limit the throughput of the system artificially when it still has capacity to process more tasks.
Sorry I wasn't clear. It seems a single queue would not artificially limit the throughput of the system when it still has the capacity to process more tasks. So what problem does this solve beyond what a single queue could do?
ores currently routes through cache_misc for termination, so to some degree this falls under general DoS protections for all such services that are handled generically at the outer edge. A few related points:
- cache_misc is already relatively-resilient just by the fact that it's a globally-distributed cluster with multiple frontend IPs in different regions, and multiple frontend hosts in each region.
- In emergencies (active DoS causing problems), we do have the ability we've used in the past to push out emergency VCL patches to block specific traffic patterns.
- Generic ratelimiting (e.g. per client IP) and other similar protection measures for these clusters has been pushed off for post-varnish4, but we do have plans around that.
- One of the best defenses you can have is to be sure that unauthenticated URLs are reasonably-cacheable (e.g. images and the root login page itself, etc). If the service emits good cacheability headers, the front edge can absorb a lot of load on cache hits. At a first cursory glance, even https://ores.wikimedia.org/ doesn't look like it's cacheable.
So instead of the abuser starving all other users, they would only starve the 1/S fraction of users who happen to fall in the same IP bucket. That's an improvement (especially if internal IPs have their separate bucket so Wikimedia services cannot be affected by an external user DoSing ORES) but still not ideal.
IMO having a separate queue (or some similar prioritization method) for internal requests (ReviewStream etc) is a good idea and should be done. Beyond that, this is not a sufficient solution and we won't need it once we have a sufficient solution (such as T148997) is in place so IMO not worth doing.
We can talk about cache strategies for the complex and/or dynamic cases (there are usually smart ways to do it), but right now ORES doesn't let anything cache, even the most basic of cases. e.g.:
Static home page at https://ores.wikimedia.org/ is not cacheable at all.
Static logo at https://ores.wikimedia.org/static/images/ores_logo.svg is not cacheable at all.
Many "attacks" or even unintentional abuses go completely un-noticed and un-noteworthy on most of our sites, because we have a robust, globally distrubuted cache front edge that withstand a fairly large about of traffic. But if you disable all caching, at least one easy layer of defense has been taken out of play.
That should be fixed when someone has enough free time, but is a fairly insignificant problem: these URLs don't get huge spikes of bot-generated traffic (this task has been phrased in terms of an "attack" but the real-world problems that led to its creation were over-eager API clients), and they are very cheap to serve (so even if someone is doing an intentional DoS attack, they are not not the most exploitable targets).
Compare to https://en.wikipedia.org/wiki/Special:ApiSandbox which is also uncacheable and also gets a tiny ammount of traffic compared to the API itself. These kind of pages are way down on the priority list for "we need to figure out how page X behaves under heavy load", because the chance that they get heavy load is so small.
Returns with: cache-control:public, max-age=43200 How is that not cacheable? I'd like to address the issue since this seems like an easy fix.
That's weird. I see no obvious reason why Varnish refuses to cache them. (OTOH https://ores.wikimedia.org/wikimedia-ui-static/MW/mediawiki.min.css is served without Content-Length which makes Varnish reject it since it tries to avoid caching huge responses.)
On the browser side, these resources do get cached properly. What seems slightly wrong (but probably has no impact in practice) is that If-None-Match/If-Modified-since headers get ignored.
$ curl 'https://ores.wikimedia.org/static/images/ores_logo.svg' -H 'if-none-match: W/"flask-1486669041.2388084-1758-3858640437"' -H 'accept-encoding: gzip, deflate, sdch, br' -H 'accept-language: en-US,en;q=0.8,hu;q=0.6' -H 'cookie: GeoIP=US:CA:Oakland:37.79:-122.12:v4; CP=H1; WMF-Last-Access=14-Feb-2017; WMF-Last-Access-Global=14-Feb-2017' -H 'if-modified-since: Thu, 09 Feb 2017 19:37:21 GMT' --compressed -I HTTP/1.1 200 OK Date: Tue, 14 Feb 2017 18:22:28 GMT Content-Type: image/svg+xml Content-Length: 1758 Connection: keep-alive Last-Modified: Thu, 09 Feb 2017 19:30:05 GMT Cache-Control: public, max-age=43200 Expires: Wed, 15 Feb 2017 06:22:28 GMT ETag: "flask-1486668605.7975135-1758-3858640437" Access-Control-Allow-Origin: * X-Varnish: 54719093, 98344173, 32942769, 1832709 Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4 Accept-Ranges: bytes Age: 0 X-Cache: cp1061 pass, cp2006 pass, cp4001 pass, cp4002 pass X-Cache-Status: pass Strict-Transport-Security: max-age=31536000; includeSubDomains; preload X-Analytics: WMF-Last-Access=14-Feb-2017;WMF-Last-Access-Global=14-Feb-2017;https=1
That should be a 304, not a 200. But the response does get cached by the browser, so meh.
Re: 304-vs-200, I was able to get some 304s, but only when I dropped the INM and relied on IMS. It seems like the ETags might be inconsistent between serial requests to ores for the same resource? (the LM timestamps are too a bit).
As for the cacheability issues in general, it's that cache_misc hadn't been updated to account for the relatively-new WMF-Last-Access-Global cookie. We ignore a list of standard non-varying cookies like those on cache_misc (some of our own analytical ones as well as Google's), and then treat any other cookie as cache-busting (as we've never enumerated all the application-specific login cookies for all of the varied services behind cache_misc). The patch above should fix that particular issue.
Getting back on the main track (how to stop non-cacheable excessive usage like the one in T157206): blocking specific IPs or user agents in Varnish should not be problematic, right? Even if the block is only for accessing ORES? While not ideal, that seems like a good enough answer to the question in the task description for now.
(As for detection, ORES getting overloaded is not catastrophic, especially if T146664 is resolved. So relying on the existing overload warning should be fine.)