Enable rate limiting on pageview api
Closed, ResolvedPublic5 Story Points

Description

Enable rate limiting on pageview api

Please see: https://github.com/wikimedia/service-runner#rate-limiting

Nuria created this task.May 13 2016, 3:22 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 13 2016, 3:22 PM
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric triaged this task as Normal priority.
Nuria added a comment.May 17 2016, 3:48 AM

Where is this logging to?

As mentioned on IRC, you'll probably want to enable global rate limiting by adding a section in config.yaml like this: https://github.com/wikimedia/restbase/pull/613/files#diff-541b6e195e9da580fc0bd42761ee761fR127

The seeds should contain the main IPs of at least one AQS node (the one RB listens on). You can, but don't need to, list all of them.

Nuria added a comment.EditedMay 18 2016, 4:15 PM

@GWicke: Per our conversation on irc "nuria_: it's very unlikely to trigger, tbh as discussed with @Milimetric, the limits are enforced per worker, and are set relatively high, next step will be to enable kademlia, which makes the limits global for the cluster"

These workers are node workers of the frontened restbase (outside the firewall)? cause seems to me that what we want are limits per cluster.

Nuria added a comment.May 18 2016, 4:22 PM

@GWicke : can you explain a bit what "worker" is on this context?

@Nuria, as described in the documentation, the default backend is a simple in-memory backend. This enforces request limits per service-runner instance, so in AQS's case per physical node (and across service-runner workers).

You are probably interested in limiting rates across the entire API, hence my recommendation to enable the Kademlia backend in the config.

Nuria added a comment.May 18 2016, 6:21 PM

@GWicke: the documentation on this regard is real meager, seems to me that to understand how it works you need prior knowledge about the inners of the service, in this case from your comment I understand that workers are independent processes that do not share memory (maybe this is obvious but making docs a bit clearer couldn't hurt)

If that is the case I am not sure of why would we ever want a rate limiting per worker given that this rate limiting configuration exists (from what I can see) outside the configuration that sets the number of workers we have.

We are interested in rate limiting the entire API on a per method basis, for which it seems kandemila is the storage and synchronization point for request counts.

Would you be so kind as to provide a sample kandemila configuration we can use or to describe in detail in the docs the meaning of the different configuration parameters?

Namely:

  1. Cluster nodes seeds:
      1. Port 3050 used by default
      2. 192.168.88.99
      3. address: some.host.com port: 6030
    1. Optional
    2. Address / port to listen on
    3. Default: localhost:3050, random port fall-back if port used listen: address: localhost port: 3050
    4. Counter update / block interval; Default: 10000ms interval: 10000

@Nuria, I walked through all of this with @Milimetric, and also gave you a link to an example config in T135240#2302880. My recollection from that conversation is that @Milimetric was planning to deploy this per-node initially, and add the kademlia config in a second step. I'm not sure what the status of his efforts is, so perhaps you could talk to him?

Nuria added a comment.May 18 2016, 6:50 PM

@GWicke: we already sync-ed up on this and both @Milimetric and myself agreed that we want limits for the "entire" api

What I (mis)understood from the IRC conversation we had was that the "basic kademlia configuration" as described in the documentation was already enabled on the RESTBase cluster that uses the metrics module. So I thought we didn't need to do anything.

Either way, that config isn't in our control, since it's not AQS that is doing the rate limiting in this case.

So do you need us to submit a puppet change to add the kademlia configuration to the RESTBase cluster? I did search the running config and it seems to not be enabled yet: https://github.com/wikimedia/operations-puppet/search?utf8=%E2%9C%93&q=kademlia

Nuria added a comment.May 19 2016, 3:40 PM

@GWicke: are you able to provide sample kandemila config?

Nuria claimed this task.May 19 2016, 4:09 PM
Nuria set the point value for this task to 5.
Nuria added a subscriber: Milimetric.

@Nuria: Again, T135240#2302880 has a link to a sample config, and instructions on which values to set for the peers.

If you would like us to set up & deploy a config for you, then I think we can do that after setting one up for the regular RESTBase install. The puppet work for that is likely to happen in the next days; @mobrovac should be able to help you with this.

Nuria added a comment.EditedMay 20 2016, 3:37 PM

@GWicke:

If you would like us to set up & deploy a config for you, then I think we can do that after setting one up for the regular RESTBase install. The puppet work for >that is likely to happen in the next days; @mobrovac should be able to help you with this.

Ahem.. i though the rate limiting for the whole api was ready to be used, is there any puppet needed?

Nuria added a comment.May 20 2016, 8:30 PM

Ahem.. i though the rate limiting for the whole api was ready to be used, is there any puppet needed?

answering my own question seems that if the distributed hash table that will store the throttling counters needs the name of the actual service machines so enabling rate limiting would always need puppet if you want to define throttling per api, not per worker. Seems odd that some of the throttling config would be on service-runner and some in puppet though.

Gerrit 290264 aims at enabling global rate limiting for RESTBase. Once that is out in prod, we will be able to enforce per-endpoint global rate limits. Note that in the first phase we will only be logging offending requests in order to make sure limiting works as expected. In a second phase, the configured limits will be actively enforced, at which point AQS will stop seeing these requests altogether.

Gerrit 290264 aims at enabling global rate limiting for RESTBase.

This is now live in prod. We'll monitor the logs for a couple of days and then start enforcing the limits.

It's been around 6h that rate limiting has been enabled in prod, and not a single excess has been logged, all the while AQS kept responding with 500s as Cassandra there cannot handle the load. I would propose to reduce the number of requests per second per client to 10~[1] and see what is being logged.

[1] that would amount to a maximum of 40 requests per second per client globally (4 endpoints times 10 requests)

Nuria added a comment.May 24 2016, 4:47 PM

What i do not understand is what is exactly enabled here:

@Nuria The DST is enabled in production right now. All the limits are per endpoint, not per worker.

Nuria added a comment.May 24 2016, 6:29 PM

@Pchelolo : so i should be able to see limits getting logged if i run a test using apache bench correct?

@Nuria not really, page views are varnish cached, so for ab only the first request will actually hit the servers. You need to request some new uri, or alter some random meaningless query parameter on every request.

Anyway, I would not recommend doing that - 100 req/s is a lot, and given that the AQS cassandra cluster is not in the best shape even with a normal req rate, abusing it with ab could cause an outage.

We should think about more conservative limits for AQS first, like 10req/s

Nuria added a comment.May 24 2016, 6:48 PM

@Pchelolo : well, i was going to "replay" requests to article endpoint which is real easy to do as we have them all. Looking at graphana seems than when requests hit restbase at higher that 30 reqs/sec per article endpoint we run into issues with 500's being thrown. Thus seems that 10 req/sec is an OK limit.

Should I send a PR to change it here: https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L100

or .. is there any other place where this limit needs to be changed?

Should I send a PR to change it here: https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L100

or .. is there any other place where this limit needs to be changed?

@Nuria No, this is the only place. Please feel free to send a PR, and I've got 10req/s just out of my mind, so you should know better what's the best number for you.

Nuria added a comment.May 24 2016, 7:17 PM

@Pchelolo : number for us at which we want to start throttling it would be ~25/30 request per whole cluster (which has three machines if i am not mistaken) .

Thus, if that setting on https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L100 is per cluster it should be set to '30', correct?

@Nuria this is per-endpoint limit. You have 3 endpoints, so 10 would be a good number

The reduced limits were deployed earlier today and show warnings when the rate is surpassed globally. This dashboard shows the excesses.

@MusikAnimal: I was initially worried about this affecting http://tools.wmflabs.org/massviews/, but since we're already inserting 100 millisecond delays between requests, I think we should be OK.

We plan to start enforcing the limits next week. I think we should start with the currently-set limits (10 reqs/s/endpoint/client). We can then tune them as appropriate later on. @Nuria ok with you?

RB logged a total of 30K excesses in the last 24h. The requests seem to come in bursts. @kaldari the logs indicate the UA in the last burst was Mozilla, can this be connected with your massview tool?

Nuria added a comment.May 26 2016, 2:39 PM

@mobrovac I think last burst has UA as ruby, we know some popular tools on labs cause bursts too but there are many automated scripts out there pulling the API. Regardless, enforcing current limits seems ok to me. I just want to make sure we will still be logging them so we know which is the frequency with which they trigger.

GWicke added a comment.EditedMay 26 2016, 6:09 PM

Lets also document these limits for each API end point by adding a list item along Stability.

Edit: PR at https://github.com/wikimedia/restbase/pull/622.

mobrovac closed this task as Resolved.Jun 1 2016, 5:37 AM

Rate limiting has been enabled in production, so resolving. Let's monitor the situation and adjust the limits separately, if needed.

Nuria added a comment.Jun 1 2016, 10:17 PM

@mobrovac : where can we see how often are we running into throttling limits? The dashboard seems pretty empty since this morning: https://logstash.wikimedia.org/#/dashboard/temp/AVToXWJes_MKeI4jrSeM

Nuria added a comment.Jun 1 2016, 10:27 PM

Never mind, throttling is no longer being logged, it should as we need to know how often does it happen thus I have filed a bug on this regard: https://phabricator.wikimedia.org/T136769

In addition to logging, there are also some metrics for 429 responses. Right now, those are only available globally (see the last graph in https://grafana-admin.wikimedia.org/dashboard/db/pageviews); https://github.com/wikimedia/hyperswitch/pull/46 will refine this to provide per-route metrics for 429s as well.

The rate limiting is breaking my bot.

GWicke added a comment.EditedJun 2 2016, 3:42 AM

@Antigng_, could you throttle your bot, so that it sends less than 10 requests per second? We are trying to make sure that all users of the pageview API are getting reasonable performance and low error rates, and 10 requests per second and client is roughly what the backend can currently sustain.

@Antigng_ this limit will increase once we get our SSDs set up and make a few more improvements, but as it is people are getting 500 errors whenever too many go over that limit.

I could reduce the concurrency by lowering the number of threads in the pool. (Current is 50.) But what if another bot task running on the same node exceeds the rate limit?

@Antigng_, request rates are limited per IP address, so multiple bots running on the same host share the quota.

https://lists.wikimedia.org/pipermail/wikitech-l/2016-June/085850.html says that 429 is now returned, so this is a duplicate of T125345#2294741. Thanks for finally addressing my request!

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jul 12 2016, 3:39 PM