service-runner rate limiting plans
Closed, DeclinedPublic
Actions

Description

Intro/Backstory

Back in 2016, the then services team identified the need for RESTBase having a ratelimiting
functionality in order to protect itself and the proxied to services from overload. The functionality was going to be
rather generic, implemented into service-runner in order to be usable by other applications
as well.

Implementation

What was implemented back then was a pluggable ratelimiter
(https://github.com/wikimedia/limitation/commits/master) with a memory and a kademlia
backend. The memory backend is generally useful for local development and in production we
use the kademlia backend. The kademlia backend is a
Distributed_hash_table
communicating over UDP in a configurable port. Messages exchanged during the first
negotiation are more or less like the following

{
"jsonrpc":"2.0",
 "id":"9875eba68e4ee4c8a2b89ca16f5e3b7935ce7ae7",
 "method":"FIND_NODE",
 "params":{ 
     "key":"8486d712a57e276a5f35319bb67d3df9491db322",
     "contact": {
           "address":"10.64.16.125",
           "port":3050,
           "nodeID":"8486d712a57e276a5f35319bb67d3df9491db322",
           "lastSeen":1570628627206}
     }
}

Over the years, this has worked rather fine in production, with just having to do some
puppet work to be able to populate the seeds parameter for in with the amount of nodes
that are meant to be in the DHT, e.g.

ratelimiter: 
  type: kademlia 
  listen: 
    address: 10.64.16.125 
    port: 3050 
  seeds: ['restbase1016.eqiad.wmnet','restbase1018.eqiad.wmnet','restbase1019.eqiad.wmnet','restbase1020.eqiad.wmnet', 
'restbase1021.eqiad.wmnet','restbase1022.eqiad.wmnet','restbase1023.eqiad.wmnet','restbase1024.eqiad.wmnet','restbase1025.eqiad.wmnet','restbase1026.eqiad.wmnet','restbase1027.eqiad.wmnet']

There are 3 interesting things in all of this.

It's essentially storing state in memory
it requires the node knowing all the other nodes in order to do so effectively
the other nodes are populated via the configuration file

Present day

Fast forward to today and we now have kubernetes. Applications on kubernetes are of an
ephemeral nature. pods can get instantiated/go away at any point in time. It can be almost
certain that the new pods will have different IPs from the old ones.

Now, to add something more in the mix, if you look at that ratelimiter stanza above, that
listen.address field has an IP address in it. Funnily enough, it's not JUST listen, it's
also the IP that gets advertised (in the stanza of the protocol above there's an address
as well). So just putting 0.0.0.0 in that listen.address field won't work as the node will
be telling other nodes to find it at 0.0.0.0 ;-). But we need to put it as 0.0.0.0 as that's the only way
we currently have to have the software will bind on all ports

So, we are in the following predicament: We want to populate the configuration file of
RESTRouter with information we don't have handy and which is used in 2 different ways.

There's more. We also want to populate the seeds field of that structure. We have worked around it in a
way. We are using a DNS A record that is bound to return the IPs of the various pods. This
means that when a pods first starts it will try to reach out to one other pod. From
that point on, in the best case scenario it will fully join the DHT network, but it is
theoretically possible (needs to be proved?) that there are race conditions where a pod
mind find itself disconnected from the global DHT because the pod it talked to initialiy
never answered in time. The pathological case of this would be to have the global DHT
split in many disjoint DHT networks of say 2 to 4 pods which don't really talk to each
other. This is could end up being really problematic for our rate limits as none would be really honored. Also,
and as far as I can tell, there is no way to inspect the state of the DHT, meaning minimal visibility into it, making it
pretty difficult to diagnose a case like that.

Note that there isn't really any retry per my tests. It's tried once when the restrouter
software starts. If it fails to join a DHT network pretty much immediately, it will not retry.

Solutions

There are some ways that we can address some of the issues, but none would work as is.

It is possible to inject into a pod the IP it was assigned, by the way of the kubernetes
downward API, but it's essentially either creating very specific format files or injecting
the IP into a variable in the environment. Neither of the 2 will work with restbase as is,
as it's lacking support for either. It should be easy to add it, but there is the question
of whether that's prudent since we will be meddling with that part of the code, there are a number of
other approaches that would make sense as well.

We could also just enumerate the interfaces in nodejs code and grab the first one that is not 127.0.0.1. In a k8s environment it would work fine, in old fashioned dev envs with perhaps many interfaces? Not so much.
We could just ditch that idea of the shared datastore for the rate limits, calculate some hardcoded local and in memory based ones and rely on these. Very duct tape and brittle overall and with the rate limits being essentially in the restbase code repo and not the helm charts it would be 2 different places to change every time we want to add/remove capacity
Have restbase become kubernetes aware. Asking the API about the other pods, about it's own IP upon initialization and so on. Way way too involved as far as I am concerned. Many new dependencies in the software just for this functionality. Plus if we are going to go that way and talk to another API (mind you a fast moving one with 4 major versions released annually) upon init, we might as well just move the ratelimiting datastore out of restbase anyway and store it in kubernetes (to be clear, let's not do this)
We could just ditch the idea of the shared datastore and just do what mediawiki/thumbor/ORES do. Store the state in a different service, aka poolcounter[1]. This is by far the best approach in my mind. It moves the state outside of kubernetes, does away with all the crappy complexity, uses a datastore we know is very resilient, designed for this use case, and has a stable API since 2009.

[1] https://www.mediawiki.org/wiki/PoolCounter

Related Objects
Search...

Status	Assigned	Task
Resolved	WDoranWMF	T220449 Split RESTBase in two services: storage service and API router/proxy
Open	None	T198901 Migrate production services to kubernetes using the pipeline
Resolved	akosiaris	T228676 Self-service Deployment Pipeline
Declined	akosiaris	T223953 Deploy the RESTBase front-end service (RESTRouter) to Kubernetes
Declined	None	T235437 RESTBase/RESTRouter/service-runner rate limiting plans
Declined	• Pchelolo	T249919 Move service-runner legacy rate limiter into hyperswitch

Event Timeline

akosiaris created this task.Oct 14 2019, 2:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2019, 2:36 PM

akosiaris triaged this task as High priority.Oct 14 2019, 2:38 PM

akosiaris added projects: SRE, Service-deployment-requests, Kubernetes, serviceops, Core Platform Team Initiatives (RESTBase Split (CDP2)).

akosiaris added subscribers: • Pchelolo, • mobrovac.

Restricted Application added a project: Services. · View Herald TranscriptOct 14 2019, 2:38 PM

akosiaris added a parent task: T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes.Oct 14 2019, 2:39 PM

For what is worth, the poolcounter approach is probably the saner one long term. And per https://www.mediawiki.org/wiki/PoolCounter the protocol is simple enough that having a PoC to gauge whether it is a valid replacement shouldn't take too much work

akosiaris mentioned this in T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes.Oct 14 2019, 2:41 PM

• mobrovac claimed this task.Oct 14 2019, 6:37 PM

• mobrovac edited projects, added Services (doing), Platform Team Workboards (Clinic Duty Team), User-mobrovac; removed Services.

• mobrovac added a project: service-runner.Oct 15 2019, 12:12 PM

I will agree with the poolcounter solution :)

The kad library that the DHT rate limiter is based on was forked. Since it worked OK, the fork was never actually updated, and in the time that's passed, the upstream library has moved names 2 times and now looks quite abandoned. So, moving away from using it was inevitable, thus I'm all for the pool counter solution.

We could just use some existing redid-backed limiter, but if we have a backend deployed, I guess why not use it.

So here are some options that we could consider.

Kademlia / DHT

As stated above (and in T200374: Update indirect dependency on github.com/gwicke/kad.git), the current rate-limiting subs-system used by service-runner needs updating. The current version is still maintained and it shouldn't be too hard to update it. That would entail changing some internal ways of doing things, but the interface exposed by service-runner could be likely kept intact. More generally, DHTs are a good fit for dynamic environments (especially considering churn) and using this approach has been shown to be rather light-weight and flexible. There would need to be some work to support it in k8s (the most important point being having one pod that can be contacted by all other pods, even if that pod is about to leave the network). Finally, the updated upstream project requires Node 10, which means we would need to move to it before completing this task.

Pro

already in use by service-runner
keeps service-runner's rate-limiting interface the same, so no changes needed to clients
needs updated dependencies that are maintained
the principles behind the solution are scalable natively
self-contained (no extra infrastructure needed to support it)
each service using service-runner's rate limiter has its own rate-limiting network

Contra

requires some extra work up-front to make it play well with k8s
services using the rate limiter would need to switch to Node 10 first

Redis-backed solution

For the purposes of rate limiting, a Redis-backed solution would also work pretty well. We have been using it in ChangeProp for detecting and blacklisting problematic titles, and therefore we know this solution works on a large scale. The client library we are using there has even support for sliding windows. Since checking and setting values is done asynchronously, using this solution would require us to change the rate-limiting interface exposed by service-runner to clients, at the same time requiring extra dependent infrastructure to be present (albeit already present and readily available in our production environment). Porting the existing CP code shouldn't be much of an effort.

Pro

already used in prod by CP
known to support a large number of keys
it should be trivial to port the code to service-runner

Contra

constitutes a breaking change in interface
clients would need to be adapted to the new paradigm
requires additional infrastructure

PoolCounter

Finally, there is PoolCounter. Its interface is very simple, and it does one job and it does it well(TM). Given its specific protocol, we would need a NodeJS client library (libraries exist currently for PHP and Python), but that shouldn't take much effort to do. Like in the case of Redis, switching to PoolCounter would entail an interface breaking change, as well as external infrastructure which is readidy available in our production environment. The one job that PoolCounter does (and does so well) is being a global mutex daemon: it limits the lock to a client-defined number of processes and blocks the rest until a spot opens up. At the time of this writing, it is not clear to me whether connections could be kept persistently or not, possibly increasing the overhead if clients need to reconnect. In the context of service-runner, this would mean switching from rate-limiting semantics to concurrency-limiting semantics, i.e. instead of tracking how many times a client has accessed a resource, service-runner clients would need to start tracking how many times is the resource being accessed now.

Pro

battle-tested in production (used extensively by MW and Thumbor)
simple interface

Contra

no NodeJS library
service-runner interface breaking change
extra infrastructure needed
change of semantics (from rate- to concurrency-limiting)

Conclusion

Given all of the above, my recommendation would be to use a Redis-backed solution. It seems to me that in terms of effort and time required, it has the best cost/benefit ratio. We already use it and have a solid Node library we can use. I would discard the DHT on the bases of the required move to Node 10. PoolCounter, on the other hand, while a viable alternative, would change the semantics of limiting, and changing semantics when trying to solve a problem technical (implementational) in nature is always tricky. On top of that, switching from rate limiting to concurrency limiting in the context of our REST APIs is not as trivial as it may seem at first glance; asking for one page might take almost no effort but might be very expensive for another, and therefore needing a more involved approach to calculating sensible limits that would both satisfy clients and not put too much strain on the infrastructure as a whole.

One detail I want to understand about the Redis hypothesis:

What happens if the rate-limiting service is unavailable or is lagging?
Does the redis implementation use lua code in Redis?

What happens if the rate-limiting service is unavailable or is lagging?

In Change-Prop, where we use it for deduplication, we simply ignore redis issues. However here it's not really an option. We can implement backup in-memory per-process (possibly per-host, with tables maintained in master process) limiting as a backup for Redis.

Does the redis implementation use lua code in Redis?

There are several redid-backed rate limiters available in node, some do use Lua, some do not. The ones using Lua are generally more feature-rich, with multiple sliding windows and more robust consistentency guarantees, but if Lua is a problem we can choose a less feature-rich option.

WDoranWMF subscribed.Oct 17 2019, 4:22 PM

akosiaris moved this task from Inbox to Backlog on the Service-deployment-requests board.Oct 30 2019, 3:29 PM

• mobrovac removed • mobrovac as the assignee of this task.Dec 13 2019, 12:10 PM

WDoranWMF lowered the priority of this task from High to Low.Mar 19 2020, 5:17 PM

WDoranWMF edited projects, added Platform Engineering (Icebox); removed Platform Team Workboards (Clinic Duty Team).

WDoranWMF closed this task as Declined.Mar 24 2020, 9:14 PM

I'm trying to interpret what this task being closed as declined means. Does it mean that the plan is to stick with (and hopefully, someday, update) the rate limiting subsystem currently in use (Kademlia/DHT)?

hey @Mholloway, we are not porting Restbase to k8s so this becomes irrelevant. Restbase will remain as is until it sunsets.

Ah, I see. My interest is specifically in service-runner as my understanding is that it will continue to be used by most or all Wikimedia Node.js services. I'm currently working on updating an open-source Node.js service for Wikimedia production and I'm wondering (a) if I should plan to incorporate service-runner as part of that work, and (b) if so, whether I can plan to use service-runner's existing rate-limiting facility or I should plan to look elsewhere for that.

service-runner itself is not going anywhere. DHT-based rate limiting however is likely to go out of the service-runner package. We might still use it in RESTBase as legacy, but AFAIK it's not used anywhere else, so we should probably pull it out of the service-runner package. BTW, it will not work in k8s at all!
How's your new service gonna be exposed? does it need its own rate limiter?

In T235437#6046252, @Mholloway wrote:

Ah, I see. My interest is specifically in service-runner as my understanding is that it will continue to be used by most or all Wikimedia Node.js services. I'm currently working on updating an open-source Node.js service for Wikimedia production and I'm wondering (a) if I should plan to incorporate service-runner as part of that work, and (b) if so, whether I can plan to use service-runner's existing rate-limiting facility or I should plan to look elsewhere for that.

I think you should continue to utilize service-runner; One of two things will happen on or before we sunset RESTBase, either we'll apply rate-limiting downstream of such services, or we'll integrate a different rate limiter implementation with service-runner.

In T235437#6046252, @Mholloway wrote:

Ah, I see. My interest is specifically in service-runner as my understanding is that it will continue to be used by most or all Wikimedia Node.js services. I'm currently working on updating an open-source Node.js service for Wikimedia production and I'm wondering (a) if I should plan to incorporate service-runner as part of that work, and (b) if so, whether I can plan to use service-runner's existing rate-limiting facility or I should plan to look elsewhere for that.

I advise against using service-runner's existing rate-limiter. It just won't work in k8s without a major rework.

Ok, to avoid further confusion, I will do T249919

Thanks, everyone. The repo I'm working with is https://github.com/mdholloway/pushd, and based on your comments, I will plan to migrate it to use the Wikimedia node services template (and, of course, service-runner).

The plan is to expose one endpoint publicly for managing push subscriptions and one privately (i.e., internal to the cluster) for triggering notification requests. We'll want rate limiting for both of those before going live, but I don't think we need to know exactly how that rate limiting will be implemented just yet.

• Pchelolo closed subtask T249919: Move service-runner legacy rate limiter into hyperswitch as Declined.Apr 10 2020, 4:39 PM

RESTBase/RESTRouter/service-runner rate limiting plansClosed, DeclinedPublicActions

Description

Intro/Backstory

Implementation

Present day

Solutions

Related ObjectsSearch...

Event Timeline

Kademlia / DHT

Pro

Contra

Redis-backed solution

Pro

Contra

PoolCounter

Pro

Contra

Conclusion

RESTBase/RESTRouter/service-runner rate limiting plans
Closed, DeclinedPublic
Actions

Related Objects
Search...