Page MenuHomePhabricator

Increase request limits for GETs to /api/rest_v1/
Closed, ResolvedPublic

Description

The REST API at /api/rest_v1/ explicitly targets high-volume use cases, and predominantly exposes cheap GET entry points backed by storage. In benchmarks with a six-node cluster, we sustained over 10k req/s for HTML revisions. Based on this, our guideline for this API is for clients to not send more than 200 req/s.

The recently-merged limits at https://gerrit.wikimedia.org/r/#/c/241643/ are significantly lower than this:

// TBF: "1, 0.02s, 250" == "50/s, burst of 250"

We have clients like Google averaging about 15 req/s, and Kiwix hitting 50 req/s when performing an HTML dump, which are both dangerously close to the limit set here. All of those requests are GETs.

To avoid legitimate use cases being blocked, I would like to request increasing the limits applying to /api/rest_v1/ to reflect our intended limits of a sustained rate around 200 req/s. My main concern is about GETs. POST entry points are significantly lower traffic and more expensive, so it would actually be nice to keep the POST limit at around what is currently set in Varnish.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: Traffic.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
GWicke renamed this task from Increase request limits for /api/rest_v1/ to Increase request limits for GETs to /api/rest_v1/.Nov 11 2015, 2:41 AM
GWicke set Security to None.
GWicke updated the task description. (Show Details)

I guess mwoffliner instances have already reached sporadically this limit because under high load it seems time to time the API simply starts refusing to answer.

I'm hesitant about this. 50/s is considered fairly high - we intend to eventually lower that number as we improve the ratelimiter to avoid special cases in more-natural ways, some of which I'll get into below...

  1. The ratelimit we're talking about here is only for varnish frontend cache requests which result in a pass or miss (or in other words, any request that actually involves a fetch from deeper in our infrastructure stack). It does not apply to a frontend cache hit. If you're talking about cheap readonly gets of cacheable content, we should probably allow varnish to cache that content as well, hopefully reducing the rate of these GETs actually passing through to RB and counting against the ratelimiter in many cases (assuming this content is also used by other clients).
  1. Any kind of client sustaining 50 reqs/sec is not "normal". If we need to carve out an exception for a specific partner (which should be rare across all use-cases), we can, and we'll need some communication and coordination with said partner on maintaining a source IP whitelist for them if necessary. But really, that's a fallback plan if anything. I'd rather avoid ever defining said whitelists and instead rely on.... :
  1. In the longer-term view of improving the ratelimit functionality, we're looking at applying a different and considerably higher ratelimit to authenticated clients which are using a legitimate session cookie (details on that are tricky and not for this ticket, probably), while going even slower than 50/s for unauthenticated ones. It's not unreasonable to ask high-rate partners to use an account; it's a much better way to be able to identify and track any issues vs just parsing/trusting User-Agent strings anyways. In the interim if it's really an issue, we could temporarily exclude all queries which appear to have a valid session cookie, but the validation part probably won't be ready anytime in the immediate future. This could also dovetail in with setting specific per-account ratelimits at the application layer for said authenticated clients. Then they can simply request raises to their per-account settings, rather than request and maintain an IP-based whitelist exception that touches the VCL level.

@BBlack: The basic issue is that we are using a blanket limit across different APIs with vastly different costs. Some batch APIs let you submit 500 expensive logical requests with a single web request, while REST APIs like RESTBase are encouraging clients to perform discrete requests, and focus on making each of those requests really cheap to process. The difference in per-request costs between entry points / APIs is several orders of magnitude.

If we are serious about limiting the damage clients can do without breaking reasonable API use cases, then there is simply no alternative to aligning limits with actual costs. This task is proposing a first step in this direction, by adjusting the limits applying to RESTBase to what is normal for this kind of API.

I also agree that 50 requests per second is very high for other entry points. I have been pushing for lowering the maximum cost of entry points (see T97192, for example), and we should definitely tighten the rate limits for especially expensive end points.

@BBlack: The basic issue is that we are using a blanket limit across different APIs with vastly different costs. Some batch APIs let you submit 500 expensive logical requests with a single web request, while REST APIs like RESTBase are encouraging clients to perform discrete requests, and focus on making each of those requests really cheap to process. The difference in per-request costs between entry points / APIs is several orders of magnitude.

If we are serious about limiting the damage clients can do without breaking reasonable API use cases, then there is simply no alternative to aligning limits with actual costs. This task is proposing a first step in this direction, by adjusting the limits applying to RESTBase to what is normal for this kind of API.

I also agree that 50 requests per second is very high for other entry points. I have been pushing for lowering the maximum cost of entry points (see T97192, for example), and we should definitely tighten the rate limits for especially expensive end points.

I think we're thinking in similar terms here, but I don't ever really expect Varnish to have knowledge of all of our services to a depth that it can understand the relative costs of various API paths. My long term view is that when it comes to API costs, I expect services to ratelimit authenticated clients on their own. This varnish-level limiting is really just an outer layer of protection against request-rates of unreasonable scale, to protect the inner layers of our architecture from damage in a [D]DoS scenario or due to abuse from completely broken client code.

Once Varnish is able to accurately distinguish (with a good chance of not being fooled) authenticated from unauthenticated requests, this gets a lot easier, as we can set a higher cap on the authenticated ones (where service code may want more discretion, and where we've got a username we can associate traffic with and/or decide to limit to lower or higher rates, force-logout, disable, or ban at the application layer), and a lower cap on the unauthenticated requests which are truly anonymous from this POV.

But for now, we only have one global ratelimit serving both functions. To put 50 reqs/sec in perspective: our total *global* rate on the text clusters at all the datacenters combined, for inbound GET requests at the front edge, averages 34K reqs/sec. A single client IP sustaining a rate of 50 reqs/sec would mean that single client IP communicating with a single edge server is responsible for 0.1% of our total global request rate on a site with millions of users. Somewhere near a rate like that, almost by definition those clients should be rare enough that they need to be identified and whitelisted if they're legitimate (or again, dealt with better by making them authenticate and making varnish cheaply and accurately verify authentication).

Oh and missing in the final paragraph above: the limiter is again only miss/pass traffic, vs the 34K/sec being all requests (~90%+ of which are cache hits). So ballpark terms, 50/sec on the limiter is more like 1.5% of traffic that makes it through the first layer of caching.

@BBlack: If there was a reasonably clean way to differentiate the limits between the action API & the rest_v1 API, would you be open to me creating a patch to do so?

This varnish-level limiting is really just an outer layer of protection against request-rates of unreasonable scale, to protect the inner layers of our architecture from damage in a [D]DoS scenario or due to abuse from completely broken client code.

My contention is that the effectiveness of this limit for DOS protection is severely compromised by the extremely uneven distribution of costs. In T64615, I documented some examples of very expensive entry points, which let you take out the API with only a few hundred requests total, at rates typical of a mobile browser loading images. While timeouts have improved since, @jcrespo just added another fresh example of the same general issue. All of those end points would be ideal candidates for a DOS (even without distribution), and don't require request rates that would trigger the current limit.

On the other hand, most end points in the REST API can sustain many thousand requests per second. Because of its relatively low costs, we want to encourage high-volume users to hit this API instead of the PHP action API. However, if we offer high-volume users a choice between

a) 50 action API batch requests times 250-500 logical requests each, and
b) 50 rest_v1 API requests with one logical request each,

we'll encourage them to go with the batch API instead. Which is exactly what we don't want, as none of those batches are cacheable.

That said, I think that 50 req/s isn't that far from what we'd want in the REST API in the longer term, once request rates are generally up & caches take over a significant portion of the overall traffic. However, we haven't enabled caching on most end points yet, and are working on a better purging infrastructure as one of our focus areas this quarter. This means that currently most of those hits will reach RESTBase directly, making the 50 req/s limit fairly tight.

@GWicke Please inform me how I can get recentchanges's literally 2 million filter options in restbase and I will personally make sure to redirect the action api calls to restbase.

@jcrespo: I might need some convincing that all of those are absolutely needed.

Is there any evidence (or even credible suspicion) that legitimate clients in the wild are hitting these limits?

fgiunchedi triaged this task as Medium priority.Dec 1 2015, 1:29 PM
fgiunchedi subscribed.
faidon claimed this task.

@faidon
What I can say is that Kiwix mwoffliner instances are hitting the limit and after having to deal with mass storage limits (IO and space) and RAM limit, now we have to deal with this too... Pretty annoying... If we don't get more servers, this is a serious matter because we will have to slow down the whole process of mirroring and won't be able anymore to release on a monthly base WM project snapshots.

Thanks, @Kelson, this is helpful feedback. Note that we have temporarily reverted the limits since Dec 28th with 4c07fac36de29eca061cb1d99d5a48464623a8d4. We'll consider this before re-enabling and figure out a way for this to be effective without hurting your use case.

@faidon
Great, really happy to hear that this limitations are temporarily off.

@Kelson - does Kiwix mwoffliner use an authenticated session, or is it anonymous? For future rate-limiting plans, it makes a big difference.

@BBlack
No, we are anonymous but mwoffliner (command line tool) forces to put an email address you can retrieve in the web client user-agent.

At the dev summit, @Bianjiang of Google voiced concerns about global request rate limits & the complexity of abiding to those across several projects / teams. The 50 req/s limit is close to what they use while tracking ongoing edits, and leaves little room for bootstrapping / back-filling data sets.

With T78676 and other related efforts (separating content into different APIs), 50 req/s global limit (limit by UserAgent?) is not enough even for regular incremental crawling,

We are running incremental crawling system to make sure we have up-to-date content with Wikipedia. Currently, the aggregated updates from major Wikipedia sites (??.wikipedia.org) is ~15-20 req/s. We are on the way to migrate to use Parsoid result to get semantic content parsed from original wikitext. We need all the 3 parts of Parsoid result: RDFa/HTML, data-mw, and data-parsoid. Besides these, we are also crawling wikidata.org, which has much higher update ratio.

another part of request is "bootstraping" and "back-filling". When some changes are made, on our side or Wikipedia side (e.g. we started to crawl ORES score for every revision), we need to batch crawl all articles. enwiki has about 10M article, with 1qps, it will take us ~100 days (assuming each article only need 1 request to backfill) ...

It would be better to allow a larger limit. In order not to overflow Wikipedia's server, we already have throttling on our side.

Reopening, reflecting the ongoing discussion.

Quick status update: We have since introduced per-entrypoint limits in the REST API. Initially, this is targeted at uncacheable transforms, as well as the pageview API, which is currently low on backend capacity. We are also planning to enforce global request rate limits for the REST API. The global limit will likely remain at the current policy of 200 req/s.

as well as the pageview API, which is currently low on backend capacity.

Correction: pageview API has been rebuild since last comment and it can handle a LOT of traffic .That doesn't mean we should not throttle, every service should have throttling limits that have been tested via performance tests. PageviewAPI can handle 400 reqs per sec per machine. Please see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS/Scaling/LoadTest

GWicke claimed this task.

@BBlack and myself looked into this yesterday after the deployment of the more aggressive global limits, and found that legitimate REST API requests were being blocked. To avoid this, Varnish now allows for up to 100 REST API cache misses per second (compared to 10/s globally), which matches metrics end points explicitly limited at 100/s per client IP. In practice, it should also allow for the documented 200 requests per second to other end points, as a large percentage of those requests are usually covered by cache hits, which do not count against the Varnish limits. I think the original objective of this task has thus been achieved, and it is time to resolve it.

As a follow-up our higher-level discussion yesterday, I created T167906, where I make the case for considering concurrency limiting instead of rate limiting.

which matches metrics end points explicitly limited at 100/s per client IP.

mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have been eaten by these limits: https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1494877802254&to=1497469802254#%2Fdashboard%2Fdb%2F

cc @BBlack

Can we please take a second look at this numbers?

mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have been eaten by these limits: https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1494877802254&to=1497469802254#%2Fdashboard%2Fdb%2F

@elukey, @Nuria We've added rate limiting on 2017-06-13, while the last datapoint in the 429 graph above is on 2017-06-08, which seems strange?

In any case, I've played with pivot webrequest and found one particularly active client: http://bit.ly/2ru1w41. Those peaks of ~40 requests per minute (sampled 1/128, so possibly somewhere around ~5k per minute in reality) are where some of the rate limiting comes into play.

Thanks to @JAllemandou's spark superpowers we've confirmed that indeed that specific IP's activity is quite bursty and it is performing more than 1k requests per 10s.

We can change the limits obviously, it's all about finding the per-IP rate that we consider acceptable.

That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE.

That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE.

Yes, and the 2nd for number of requests to /api/rest_v1 is another Hetzner IP (same User-Agent).

Thanks for the prompt response, when the number of changes I did not see when these took effect, it is true that we do not see on our end 429s at all times, but we shoudl see them here and there if throttling changes at varnish layer are "low enough".

If you look at 404s however, looks like the throttling had a positive effect on removing "garbaage-y" traffic.

which matches metrics end points explicitly limited at 100/s per client IP.

mmm... looking at pageview API dashboard I can see some of lawful traffic (spikes we could have handled) seems to have been eaten by these limits: https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1494877802254&to=1497469802254#%2Fdashboard%2Fdb%2F

These metrics are 429s emitted from RESTBase, and not Varnish. They are emitted when clients exceed the configured & documented per-entrypoint request rate limits. In the case of per-article metrics, this is 100 requests per second. If you would like to adjust those limits, or believe that they are not working correctly, then I would encourage you to open a separate task.

These metrics are 429s emitted from RESTBase, and not Varnish.

Right, that is why we should continue to see throttling on the rest base end. Do take a second look at my comments, my point was that throttling in varnish layer might have been too "eager" but I was mistaken.