Page MenuHomePhabricator

RESTBase/RESTRouter/service-runner rate limiting plans
Closed, DeclinedPublic

Description

Intro/Backstory

Back in 2016, the then services team identified the need for RESTBase having a ratelimiting
functionality in order to protect itself and the proxied to services from overload. The functionality was going to be
rather generic, implemented into service-runner in order to be usable by other applications
as well.

Implementation

What was implemented back then was a pluggable ratelimiter
(https://github.com/wikimedia/limitation/commits/master) with a memory and a kademlia
backend. The memory backend is generally useful for local development and in production we
use the kademlia backend. The kademlia backend is a
Distributed_hash_table
communicating over UDP in a configurable port. Messages exchanged during the first
negotiation are more or less like the following

{
"jsonrpc":"2.0",
 "id":"9875eba68e4ee4c8a2b89ca16f5e3b7935ce7ae7",
 "method":"FIND_NODE",
 "params":{ 
     "key":"8486d712a57e276a5f35319bb67d3df9491db322",
     "contact": {
           "address":"10.64.16.125",
           "port":3050,
           "nodeID":"8486d712a57e276a5f35319bb67d3df9491db322",
           "lastSeen":1570628627206}
     }
}

Over the years, this has worked rather fine in production, with just having to do some
puppet work to be able to populate the seeds parameter for in with the amount of nodes
that are meant to be in the DHT, e.g.

ratelimiter: 
  type: kademlia 
  listen: 
    address: 10.64.16.125 
    port: 3050 
  seeds: ['restbase1016.eqiad.wmnet','restbase1018.eqiad.wmnet','restbase1019.eqiad.wmnet','restbase1020.eqiad.wmnet', 
'restbase1021.eqiad.wmnet','restbase1022.eqiad.wmnet','restbase1023.eqiad.wmnet','restbase1024.eqiad.wmnet','restbase1025.eqiad.wmnet','restbase1026.eqiad.wmnet','restbase1027.eqiad.wmnet']

There are 3 interesting things in all of this.

  1. It's essentially storing state in memory
  2. it requires the node knowing all the other nodes in order to do so effectively
  3. the other nodes are populated via the configuration file

Present day

Fast forward to today and we now have kubernetes. Applications on kubernetes are of an
ephemeral nature. pods can get instantiated/go away at any point in time. It can be almost
certain that the new pods will have different IPs from the old ones.

Now, to add something more in the mix, if you look at that ratelimiter stanza above, that
listen.address field has an IP address in it. Funnily enough, it's not JUST listen, it's
also the IP that gets advertised (in the stanza of the protocol above there's an address
as well). So just putting 0.0.0.0 in that listen.address field won't work as the node will
be telling other nodes to find it at 0.0.0.0 ;-). But we need to put it as 0.0.0.0 as that's the only way
we currently have to have the software will bind on all ports

So, we are in the following predicament: We want to populate the configuration file of
RESTRouter with information we don't have handy and which is used in 2 different ways.

There's more. We also want to populate the seeds field of that structure. We have worked around it in a
way. We are using a DNS A record that is bound to return the IPs of the various pods. This
means that when a pods first starts it will try to reach out to one other pod. From
that point on, in the best case scenario it will fully join the DHT network, but it is
theoretically possible (needs to be proved?) that there are race conditions where a pod
mind find itself disconnected from the global DHT because the pod it talked to initialiy
never answered in time. The pathological case of this would be to have the global DHT
split in many disjoint DHT networks of say 2 to 4 pods which don't really talk to each
other. This is could end up being really problematic for our rate limits as none would be really honored. Also,
and as far as I can tell, there is no way to inspect the state of the DHT, meaning minimal visibility into it, making it
pretty difficult to diagnose a case like that.

Note that there isn't really any retry per my tests. It's tried once when the restrouter
software starts. If it fails to join a DHT network pretty much immediately, it will not retry.

Solutions

There are some ways that we can address some of the issues, but none would work as is.

It is possible to inject into a pod the IP it was assigned, by the way of the kubernetes
downward API, but it's essentially either creating very specific format files or injecting
the IP into a variable in the environment. Neither of the 2 will work with restbase as is,
as it's lacking support for either. It should be easy to add it, but there is the question
of whether that's prudent since we will be meddling with that part of the code, there are a number of
other approaches that would make sense as well.

  • We could also just enumerate the interfaces in nodejs code and grab the first one that is not 127.0.0.1. In a k8s environment it would work fine, in old fashioned dev envs with perhaps many interfaces? Not so much.
  • We could just ditch that idea of the shared datastore for the rate limits, calculate some hardcoded local and in memory based ones and rely on these. Very duct tape and brittle overall and with the rate limits being essentially in the restbase code repo and not the helm charts it would be 2 different places to change every time we want to add/remove capacity
  • Have restbase become kubernetes aware. Asking the API about the other pods, about it's own IP upon initialization and so on. Way way too involved as far as I am concerned. Many new dependencies in the software just for this functionality. Plus if we are going to go that way and talk to another API (mind you a fast moving one with 4 major versions released annually) upon init, we might as well just move the ratelimiting datastore out of restbase anyway and store it in kubernetes (to be clear, let's not do this)
  • We could just ditch the idea of the shared datastore and just do what mediawiki/thumbor/ORES do. Store the state in a different service, aka poolcounter[1]. This is by far the best approach in my mind. It moves the state outside of kubernetes, does away with all the crappy complexity, uses a datastore we know is very resilient, designed for this use case, and has a stable API since 2009.

[1] https://www.mediawiki.org/wiki/PoolCounter

Event Timeline

For what is worth, the poolcounter approach is probably the saner one long term. And per https://www.mediawiki.org/wiki/PoolCounter the protocol is simple enough that having a PoC to gauge whether it is a valid replacement shouldn't take too much work

I will agree with the poolcounter solution :)

The kad library that the DHT rate limiter is based on was forked. Since it worked OK, the fork was never actually updated, and in the time that's passed, the upstream library has moved names 2 times and now looks quite abandoned. So, moving away from using it was inevitable, thus I'm all for the pool counter solution.

We could just use some existing redid-backed limiter, but if we have a backend deployed, I guess why not use it.

So here are some options that we could consider.

Kademlia / DHT

As stated above (and in T200374: Update indirect dependency on github.com/gwicke/kad.git), the current rate-limiting subs-system used by service-runner needs updating. The current version is still maintained and it shouldn't be too hard to update it. That would entail changing some internal ways of doing things, but the interface exposed by service-runner could be likely kept intact. More generally, DHTs are a good fit for dynamic environments (especially considering churn) and using this approach has been shown to be rather light-weight and flexible. There would need to be some work to support it in k8s (the most important point being having one pod that can be contacted by all other pods, even if that pod is about to leave the network). Finally, the updated upstream project requires Node 10, which means we would need to move to it before completing this task.

Pro
  • already in use by service-runner
  • keeps service-runner's rate-limiting interface the same, so no changes needed to clients
  • needs updated dependencies that are maintained
  • the principles behind the solution are scalable natively
  • self-contained (no extra infrastructure needed to support it)
  • each service using service-runner's rate limiter has its own rate-limiting network
Contra
  • requires some extra work up-front to make it play well with k8s
  • services using the rate limiter would need to switch to Node 10 first

Redis-backed solution

For the purposes of rate limiting, a Redis-backed solution would also work pretty well. We have been using it in ChangeProp for detecting and blacklisting problematic titles, and therefore we know this solution works on a large scale. The client library we are using there has even support for sliding windows. Since checking and setting values is done asynchronously, using this solution would require us to change the rate-limiting interface exposed by service-runner to clients, at the same time requiring extra dependent infrastructure to be present (albeit already present and readily available in our production environment). Porting the existing CP code shouldn't be much of an effort.

Pro
  • already used in prod by CP
  • known to support a large number of keys
  • it should be trivial to port the code to service-runner
Contra
  • constitutes a breaking change in interface
  • clients would need to be adapted to the new paradigm
  • requires additional infrastructure

PoolCounter

Finally, there is PoolCounter. Its interface is very simple, and it does one job and it does it well(TM). Given its specific protocol, we would need a NodeJS client library (libraries exist currently for PHP and Python), but that shouldn't take much effort to do. Like in the case of Redis, switching to PoolCounter would entail an interface breaking change, as well as external infrastructure which is readidy available in our production environment. The one job that PoolCounter does (and does so well) is being a global mutex daemon: it limits the lock to a client-defined number of processes and blocks the rest until a spot opens up. At the time of this writing, it is not clear to me whether connections could be kept persistently or not, possibly increasing the overhead if clients need to reconnect. In the context of service-runner, this would mean switching from rate-limiting semantics to concurrency-limiting semantics, i.e. instead of tracking how many times a client has accessed a resource, service-runner clients would need to start tracking how many times is the resource being accessed now.

Pro
  • battle-tested in production (used extensively by MW and Thumbor)
  • simple interface
Contra
  • no NodeJS library
  • service-runner interface breaking change
  • extra infrastructure needed
  • change of semantics (from rate- to concurrency-limiting)

Conclusion

Given all of the above, my recommendation would be to use a Redis-backed solution. It seems to me that in terms of effort and time required, it has the best cost/benefit ratio. We already use it and have a solid Node library we can use. I would discard the DHT on the bases of the required move to Node 10. PoolCounter, on the other hand, while a viable alternative, would change the semantics of limiting, and changing semantics when trying to solve a problem technical (implementational) in nature is always tricky. On top of that, switching from rate limiting to concurrency limiting in the context of our REST APIs is not as trivial as it may seem at first glance; asking for one page might take almost no effort but might be very expensive for another, and therefore needing a more involved approach to calculating sensible limits that would both satisfy clients and not put too much strain on the infrastructure as a whole.

One detail I want to understand about the Redis hypothesis:

  • What happens if the rate-limiting service is unavailable or is lagging?
  • Does the redis implementation use lua code in Redis?

What happens if the rate-limiting service is unavailable or is lagging?

In Change-Prop, where we use it for deduplication, we simply ignore redis issues. However here it's not really an option. We can implement backup in-memory per-process (possibly per-host, with tables maintained in master process) limiting as a backup for Redis.

Does the redis implementation use lua code in Redis?

There are several redid-backed rate limiters available in node, some do use Lua, some do not. The ones using Lua are generally more feature-rich, with multiple sliding windows and more robust consistentency guarantees, but if Lua is a problem we can choose a less feature-rich option.

I'm trying to interpret what this task being closed as declined means. Does it mean that the plan is to stick with (and hopefully, someday, update) the rate limiting subsystem currently in use (Kademlia/DHT)?

hey @Mholloway, we are not porting Restbase to k8s so this becomes irrelevant. Restbase will remain as is until it sunsets.

Ah, I see. My interest is specifically in service-runner as my understanding is that it will continue to be used by most or all Wikimedia Node.js services. I'm currently working on updating an open-source Node.js service for Wikimedia production and I'm wondering (a) if I should plan to incorporate service-runner as part of that work, and (b) if so, whether I can plan to use service-runner's existing rate-limiting facility or I should plan to look elsewhere for that.

service-runner itself is not going anywhere. DHT-based rate limiting however is likely to go out of the service-runner package. We might still use it in RESTBase as legacy, but AFAIK it's not used anywhere else, so we should probably pull it out of the service-runner package. BTW, it will not work in k8s at all!
How's your new service gonna be exposed? does it need its own rate limiter?

Ah, I see. My interest is specifically in service-runner as my understanding is that it will continue to be used by most or all Wikimedia Node.js services. I'm currently working on updating an open-source Node.js service for Wikimedia production and I'm wondering (a) if I should plan to incorporate service-runner as part of that work, and (b) if so, whether I can plan to use service-runner's existing rate-limiting facility or I should plan to look elsewhere for that.

I think you should continue to utilize service-runner; One of two things will happen on or before we sunset RESTBase, either we'll apply rate-limiting downstream of such services, or we'll integrate a different rate limiter implementation with service-runner.

Ah, I see. My interest is specifically in service-runner as my understanding is that it will continue to be used by most or all Wikimedia Node.js services. I'm currently working on updating an open-source Node.js service for Wikimedia production and I'm wondering (a) if I should plan to incorporate service-runner as part of that work, and (b) if so, whether I can plan to use service-runner's existing rate-limiting facility or I should plan to look elsewhere for that.

I advise against using service-runner's existing rate-limiter. It just won't work in k8s without a major rework.

Ok, to avoid further confusion, I will do T249919

Thanks, everyone. The repo I'm working with is https://github.com/mdholloway/pushd, and based on your comments, I will plan to migrate it to use the Wikimedia node services template (and, of course, service-runner).

The plan is to expose one endpoint publicly for managing push subscriptions and one privately (i.e., internal to the cluster) for triggering notification requests. We'll want rate limiting for both of those before going live, but I don't think we need to know exactly how that rate limiting will be implemented just yet.