Page MenuHomePhabricator

Enable rate limiting on pageview api
Closed, ResolvedPublic5 Estimated Story Points

Description

Enable rate limiting on pageview api

Please see: https://github.com/wikimedia/service-runner#rate-limiting

Event Timeline

Milimetric triaged this task as Medium priority.May 16 2016, 8:58 PM
Milimetric moved this task from In Progress to In Code Review on the Analytics-Kanban board.

As mentioned on IRC, you'll probably want to enable global rate limiting by adding a section in config.yaml like this: https://github.com/wikimedia/restbase/pull/613/files#diff-541b6e195e9da580fc0bd42761ee761fR127

The seeds should contain the main IPs of at least one AQS node (the one RB listens on). You can, but don't need to, list all of them.

@GWicke: Per our conversation on irc "nuria_: it's very unlikely to trigger, tbh as discussed with @Milimetric, the limits are enforced per worker, and are set relatively high, next step will be to enable kademlia, which makes the limits global for the cluster"

These workers are node workers of the frontened restbase (outside the firewall)? cause seems to me that what we want are limits per cluster.

@GWicke : can you explain a bit what "worker" is on this context?

@Nuria, as described in the documentation, the default backend is a simple in-memory backend. This enforces request limits per service-runner instance, so in AQS's case per physical node (and across service-runner workers).

You are probably interested in limiting rates across the entire API, hence my recommendation to enable the Kademlia backend in the config.

@GWicke: the documentation on this regard is real meager, seems to me that to understand how it works you need prior knowledge about the inners of the service, in this case from your comment I understand that workers are independent processes that do not share memory (maybe this is obvious but making docs a bit clearer couldn't hurt)

If that is the case I am not sure of why would we ever want a rate limiting per worker given that this rate limiting configuration exists (from what I can see) outside the configuration that sets the number of workers we have.

We are interested in rate limiting the entire API on a per method basis, for which it seems kandemila is the storage and synchronization point for request counts.

Would you be so kind as to provide a sample kandemila configuration we can use or to describe in detail in the docs the meaning of the different configuration parameters?

Namely:

  1. Cluster nodes seeds:
      1. Port 3050 used by default
      2. 192.168.88.99
      3. address: some.host.com port: 6030
    1. Optional
    2. Address / port to listen on
    3. Default: localhost:3050, random port fall-back if port used listen: address: localhost port: 3050
    4. Counter update / block interval; Default: 10000ms interval: 10000

@Nuria, I walked through all of this with @Milimetric, and also gave you a link to an example config in T135240#2302880. My recollection from that conversation is that @Milimetric was planning to deploy this per-node initially, and add the kademlia config in a second step. I'm not sure what the status of his efforts is, so perhaps you could talk to him?

@GWicke: we already sync-ed up on this and both @Milimetric and myself agreed that we want limits for the "entire" api

What I (mis)understood from the IRC conversation we had was that the "basic kademlia configuration" as described in the documentation was already enabled on the RESTBase cluster that uses the metrics module. So I thought we didn't need to do anything.

Either way, that config isn't in our control, since it's not AQS that is doing the rate limiting in this case.

So do you need us to submit a puppet change to add the kademlia configuration to the RESTBase cluster? I did search the running config and it seems to not be enabled yet: https://github.com/wikimedia/operations-puppet/search?utf8=%E2%9C%93&q=kademlia

@GWicke: are you able to provide sample kandemila config?

Nuria set the point value for this task to 5.
Nuria added a subscriber: Milimetric.

@Nuria: Again, T135240#2302880 has a link to a sample config, and instructions on which values to set for the peers.

If you would like us to set up & deploy a config for you, then I think we can do that after setting one up for the regular RESTBase install. The puppet work for that is likely to happen in the next days; @mobrovac should be able to help you with this.

@GWicke:

If you would like us to set up & deploy a config for you, then I think we can do that after setting one up for the regular RESTBase install. The puppet work for >that is likely to happen in the next days; @mobrovac should be able to help you with this.

Ahem.. i though the rate limiting for the whole api was ready to be used, is there any puppet needed?

Ahem.. i though the rate limiting for the whole api was ready to be used, is there any puppet needed?

answering my own question seems that if the distributed hash table that will store the throttling counters needs the name of the actual service machines so enabling rate limiting would always need puppet if you want to define throttling per api, not per worker. Seems odd that some of the throttling config would be on service-runner and some in puppet though.

Gerrit 290264 aims at enabling global rate limiting for RESTBase. Once that is out in prod, we will be able to enforce per-endpoint global rate limits. Note that in the first phase we will only be logging offending requests in order to make sure limiting works as expected. In a second phase, the configured limits will be actively enforced, at which point AQS will stop seeing these requests altogether.

Gerrit 290264 aims at enabling global rate limiting for RESTBase.

This is now live in prod. We'll monitor the logs for a couple of days and then start enforcing the limits.

It's been around 6h that rate limiting has been enabled in prod, and not a single excess has been logged, all the while AQS kept responding with 500s as Cassandra there cannot handle the load. I would propose to reduce the number of requests per second per client to 10~[1] and see what is being logged.

[1] that would amount to a maximum of 40 requests per second per client globally (4 endpoints times 10 requests)

What i do not understand is what is exactly enabled here:

@Nuria The DST is enabled in production right now. All the limits are per endpoint, not per worker.

@Pchelolo : so i should be able to see limits getting logged if i run a test using apache bench correct?

@Nuria not really, page views are varnish cached, so for ab only the first request will actually hit the servers. You need to request some new uri, or alter some random meaningless query parameter on every request.

Anyway, I would not recommend doing that - 100 req/s is a lot, and given that the AQS cassandra cluster is not in the best shape even with a normal req rate, abusing it with ab could cause an outage.

We should think about more conservative limits for AQS first, like 10req/s

@Pchelolo : well, i was going to "replay" requests to article endpoint which is real easy to do as we have them all. Looking at graphana seems than when requests hit restbase at higher that 30 reqs/sec per article endpoint we run into issues with 500's being thrown. Thus seems that 10 req/sec is an OK limit.

Should I send a PR to change it here: https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L100

or .. is there any other place where this limit needs to be changed?

Should I send a PR to change it here: https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L100

or .. is there any other place where this limit needs to be changed?

@Nuria No, this is the only place. Please feel free to send a PR, and I've got 10req/s just out of my mind, so you should know better what's the best number for you.

@Pchelolo : number for us at which we want to start throttling it would be ~25/30 request per whole cluster (which has three machines if i am not mistaken) .

Thus, if that setting on https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L100 is per cluster it should be set to '30', correct?

@Nuria this is per-endpoint limit. You have 3 endpoints, so 10 would be a good number

The reduced limits were deployed earlier today and show warnings when the rate is surpassed globally. This dashboard shows the excesses.

@MusikAnimal: I was initially worried about this affecting http://tools.wmflabs.org/massviews/, but since we're already inserting 100 millisecond delays between requests, I think we should be OK.

We plan to start enforcing the limits next week. I think we should start with the currently-set limits (10 reqs/s/endpoint/client). We can then tune them as appropriate later on. @Nuria ok with you?

RB logged a total of 30K excesses in the last 24h. The requests seem to come in bursts. @kaldari the logs indicate the UA in the last burst was Mozilla, can this be connected with your massview tool?

@mobrovac I think last burst has UA as ruby, we know some popular tools on labs cause bursts too but there are many automated scripts out there pulling the API. Regardless, enforcing current limits seems ok to me. I just want to make sure we will still be logging them so we know which is the frequency with which they trigger.

Lets also document these limits for each API end point by adding a list item along Stability.

Edit: PR at https://github.com/wikimedia/restbase/pull/622.

Rate limiting has been enabled in production, so resolving. Let's monitor the situation and adjust the limits separately, if needed.

@mobrovac : where can we see how often are we running into throttling limits? The dashboard seems pretty empty since this morning: https://logstash.wikimedia.org/#/dashboard/temp/AVToXWJes_MKeI4jrSeM

Never mind, throttling is no longer being logged, it should as we need to know how often does it happen thus I have filed a bug on this regard: https://phabricator.wikimedia.org/T136769

In addition to logging, there are also some metrics for 429 responses. Right now, those are only available globally (see the last graph in https://grafana-admin.wikimedia.org/dashboard/db/pageviews); https://github.com/wikimedia/hyperswitch/pull/46 will refine this to provide per-route metrics for 429s as well.

The rate limiting is breaking my bot.

@Antigng_, could you throttle your bot, so that it sends less than 10 requests per second? We are trying to make sure that all users of the pageview API are getting reasonable performance and low error rates, and 10 requests per second and client is roughly what the backend can currently sustain.

@Antigng_ this limit will increase once we get our SSDs set up and make a few more improvements, but as it is people are getting 500 errors whenever too many go over that limit.

I could reduce the concurrency by lowering the number of threads in the pool. (Current is 50.) But what if another bot task running on the same node exceeds the rate limit?

@Antigng_, request rates are limited per IP address, so multiple bots running on the same host share the quota.

https://lists.wikimedia.org/pipermail/wikitech-l/2016-June/085850.html says that 429 is now returned, so this is a duplicate of T125345#2294741. Thanks for finally addressing my request!