Intro/Backstory
Back in 2016, the then services team identified the need for RESTBase having a ratelimiting
functionality in order to protect itself and the proxied to services from overload. The functionality was going to be
rather generic, implemented into service-runner in order to be usable by other applications
as well.
Implementation
What was implemented back then was a pluggable ratelimiter
(https://github.com/wikimedia/limitation/commits/master) with a memory and a kademlia
backend. The memory backend is generally useful for local development and in production we
use the kademlia backend. The kademlia backend is a
Distributed_hash_table
communicating over UDP in a configurable port. Messages exchanged during the first
negotiation are more or less like the following
{ "jsonrpc":"2.0", "id":"9875eba68e4ee4c8a2b89ca16f5e3b7935ce7ae7", "method":"FIND_NODE", "params":{ "key":"8486d712a57e276a5f35319bb67d3df9491db322", "contact": { "address":"10.64.16.125", "port":3050, "nodeID":"8486d712a57e276a5f35319bb67d3df9491db322", "lastSeen":1570628627206} } }
Over the years, this has worked rather fine in production, with just having to do some
puppet work to be able to populate the seeds parameter for in with the amount of nodes
that are meant to be in the DHT, e.g.
ratelimiter: type: kademlia listen: address: 10.64.16.125 port: 3050 seeds: ['restbase1016.eqiad.wmnet','restbase1018.eqiad.wmnet','restbase1019.eqiad.wmnet','restbase1020.eqiad.wmnet', 'restbase1021.eqiad.wmnet','restbase1022.eqiad.wmnet','restbase1023.eqiad.wmnet','restbase1024.eqiad.wmnet','restbase1025.eqiad.wmnet','restbase1026.eqiad.wmnet','restbase1027.eqiad.wmnet']
There are 3 interesting things in all of this.
- It's essentially storing state in memory
- it requires the node knowing all the other nodes in order to do so effectively
- the other nodes are populated via the configuration file
Present day
Fast forward to today and we now have kubernetes. Applications on kubernetes are of an
ephemeral nature. pods can get instantiated/go away at any point in time. It can be almost
certain that the new pods will have different IPs from the old ones.
Now, to add something more in the mix, if you look at that ratelimiter stanza above, that
listen.address field has an IP address in it. Funnily enough, it's not JUST listen, it's
also the IP that gets advertised (in the stanza of the protocol above there's an address
as well). So just putting 0.0.0.0 in that listen.address field won't work as the node will
be telling other nodes to find it at 0.0.0.0 ;-). But we need to put it as 0.0.0.0 as that's the only way
we currently have to have the software will bind on all ports
So, we are in the following predicament: We want to populate the configuration file of
RESTRouter with information we don't have handy and which is used in 2 different ways.
There's more. We also want to populate the seeds field of that structure. We have worked around it in a
way. We are using a DNS A record that is bound to return the IPs of the various pods. This
means that when a pods first starts it will try to reach out to one other pod. From
that point on, in the best case scenario it will fully join the DHT network, but it is
theoretically possible (needs to be proved?) that there are race conditions where a pod
mind find itself disconnected from the global DHT because the pod it talked to initialiy
never answered in time. The pathological case of this would be to have the global DHT
split in many disjoint DHT networks of say 2 to 4 pods which don't really talk to each
other. This is could end up being really problematic for our rate limits as none would be really honored. Also,
and as far as I can tell, there is no way to inspect the state of the DHT, meaning minimal visibility into it, making it
pretty difficult to diagnose a case like that.
Note that there isn't really any retry per my tests. It's tried once when the restrouter
software starts. If it fails to join a DHT network pretty much immediately, it will not retry.
Solutions
There are some ways that we can address some of the issues, but none would work as is.
It is possible to inject into a pod the IP it was assigned, by the way of the kubernetes
downward API, but it's essentially either creating very specific format files or injecting
the IP into a variable in the environment. Neither of the 2 will work with restbase as is,
as it's lacking support for either. It should be easy to add it, but there is the question
of whether that's prudent since we will be meddling with that part of the code, there are a number of
other approaches that would make sense as well.
- We could also just enumerate the interfaces in nodejs code and grab the first one that is not 127.0.0.1. In a k8s environment it would work fine, in old fashioned dev envs with perhaps many interfaces? Not so much.
- We could just ditch that idea of the shared datastore for the rate limits, calculate some hardcoded local and in memory based ones and rely on these. Very duct tape and brittle overall and with the rate limits being essentially in the restbase code repo and not the helm charts it would be 2 different places to change every time we want to add/remove capacity
- Have restbase become kubernetes aware. Asking the API about the other pods, about it's own IP upon initialization and so on. Way way too involved as far as I am concerned. Many new dependencies in the software just for this functionality. Plus if we are going to go that way and talk to another API (mind you a fast moving one with 4 major versions released annually) upon init, we might as well just move the ratelimiting datastore out of restbase anyway and store it in kubernetes (to be clear, let's not do this)
- We could just ditch the idea of the shared datastore and just do what mediawiki/thumbor/ORES do. Store the state in a different service, aka poolcounter[1]. This is by far the best approach in my mind. It moves the state outside of kubernetes, does away with all the crappy complexity, uses a datastore we know is very resilient, designed for this use case, and has a stable API since 2009.