Currently, gdnsd's datacenter-level geoip failover is limited in how it conceives of failure scenarios. Regardless of whether we're using automatic or manual mapping of datacenter priority per-client-location, only the clients assigned to the failed datacenter get remapped to new locations. However, an optimal solution would often involve also shifting some load around between the remaining datacenters to make room for some of the incoming clients at better average global latencies. This can be done today manually by editing the map, but is cumbersome and difficult to get right without a ton of latency/load analysis and manual research.
At the core of this problem within gdnsd's code is that the datacenter mappings per-client-location are a fixed failover list which is independent of choices made for other locations. An idea for a smarter, alternative structuring and behavior for the auto-mapping-mode which I would call "dynamic weighted mapping" (?) would look something like this:
- Add configuration parameters to set a weight value for each datacenter in arbitrarily-scaled units. This is meant to indicate the relative client capacity of each datacenter in approximate terms.
- Add a data source for approximate client weight value for client locations (e.g. approximate user count per country). I'm not sure how this would look or where we'd source it from. Perhaps some annual internet population report would serve as a reasonable default source? But many services (ourselves included) won't have actual client weights that map 1:1 with internet population. Possibly we could monitor this more-directly and have app-layer code outputting average summaries based on geoip of the clients? This doesn't need to be very realtime-y, it's the kind of thing you could get away with updating once a month or so even.
- Rather than calculating a full failover ordering per-client-location, calculate a single-depth (no failover) map of the world based on the combination of geographic distance and weighted load. In other words, picture gdnsd's algorithm splitting the planet into bubbles around each datacenter, and dynamically weighting the relative bubble sizes to get the shortest average distances it can without exceeding the load weights. In other words, the bubbles' radius would be proportional to datacenter weight, if they were drawn on a flat map where the coordinate system was distorted to represent the per-client-location user weight, if that makes any sense.
- On datacenter failure (admin_state or detected, depending on config), recalculate the mapping on the fly and reapply it. This could take multiple seconds for a complex map, but is quick enough in practice for such an event, and is far more reasonable and efficient than pre-calculating all possible datacenter availability scenarios for large datacenter counts. The net result is that the failed datacenter's bubble vanishes, and the adjacent bubbles from remaining datacenters would expand into its territory proportionately, while probably also shrinking on their other edges and giving up room to other datacenters that were not adjacent to the failure to take over some of their loads to re-level everything.
As we get into these more complex scenarios (but true also even today), it would also be helpful to have gdnsd able to dump some JSON output about current global mappings, which could be combined (with some other data by external scripts) to generate SVG visualizations of current and alternative scenarios.