Page MenuHomePhabricator

implement better failure-scenario geoip mapping in gdnsd
Closed, DuplicatePublic

Description

Currently, gdnsd's datacenter-level geoip failover is limited in how it conceives of failure scenarios. Regardless of whether we're using automatic or manual mapping of datacenter priority per-client-location, only the clients assigned to the failed datacenter get remapped to new locations. However, an optimal solution would often involve also shifting some load around between the remaining datacenters to make room for some of the incoming clients at better average global latencies. This can be done today manually by editing the map, but is cumbersome and difficult to get right without a ton of latency/load analysis and manual research.

At the core of this problem within gdnsd's code is that the datacenter mappings per-client-location are a fixed failover list which is independent of choices made for other locations. An idea for a smarter, alternative structuring and behavior for the auto-mapping-mode which I would call "dynamic weighted mapping" (?) would look something like this:

  1. Add configuration parameters to set a weight value for each datacenter in arbitrarily-scaled units. This is meant to indicate the relative client capacity of each datacenter in approximate terms.
  2. Add a data source for approximate client weight value for client locations (e.g. approximate user count per country). I'm not sure how this would look or where we'd source it from. Perhaps some annual internet population report would serve as a reasonable default source? But many services (ourselves included) won't have actual client weights that map 1:1 with internet population. Possibly we could monitor this more-directly and have app-layer code outputting average summaries based on geoip of the clients? This doesn't need to be very realtime-y, it's the kind of thing you could get away with updating once a month or so even.
  3. Rather than calculating a full failover ordering per-client-location, calculate a single-depth (no failover) map of the world based on the combination of geographic distance and weighted load. In other words, picture gdnsd's algorithm splitting the planet into bubbles around each datacenter, and dynamically weighting the relative bubble sizes to get the shortest average distances it can without exceeding the load weights. In other words, the bubbles' radius would be proportional to datacenter weight, if they were drawn on a flat map where the coordinate system was distorted to represent the per-client-location user weight, if that makes any sense.
  4. On datacenter failure (admin_state or detected, depending on config), recalculate the mapping on the fly and reapply it. This could take multiple seconds for a complex map, but is quick enough in practice for such an event, and is far more reasonable and efficient than pre-calculating all possible datacenter availability scenarios for large datacenter counts. The net result is that the failed datacenter's bubble vanishes, and the adjacent bubbles from remaining datacenters would expand into its territory proportionately, while probably also shrinking on their other edges and giving up room to other datacenters that were not adjacent to the failure to take over some of their loads to re-level everything.

As we get into these more complex scenarios (but true also even today), it would also be helpful to have gdnsd able to dump some JSON output about current global mappings, which could be combined (with some other data by external scripts) to generate SVG visualizations of current and alternative scenarios.

Event Timeline

BBlack created this task.Apr 1 2015, 2:38 PM
BBlack claimed this task.
BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added a project: acl*sre-team.
BBlack added a subscriber: BBlack.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2015, 2:38 PM
BBlack set Security to None.Apr 1 2015, 2:38 PM
BBlack added a subscriber: faidon.
BBlack updated the task description. (Show Details)Apr 1 2015, 2:41 PM
BBlack updated the task description. (Show Details)
BBlack updated the task description. (Show Details)Apr 1 2015, 2:44 PM
BBlack triaged this task as Normal priority.Apr 1 2015, 2:52 PM
Krinkle added a subscriber: Krinkle.Apr 2 2015, 7:37 AM
BBlack added a comment.EditedJul 7 2015, 2:08 PM

To write down some more-recent thoughts about this subject:

  • Divide the globe into a reasonable grid, let's say on 1/10th degree boundaries (3600x1800 map). Grid granularity could have a few configurable options perhaps. For the rest of this discussion, it's easier to think of the grid as a linear 2-dimensional map of a flat space, but obviously a real implementation would have to know it's cartesian coordinates on a globe or polar or whatever when looking at distances and spirals, etc...
  • Population input to gdnsd would be in the form of a file which has 3600x1800 cells, each filled with a number. That number represents a load/population estimate per square. You could easily imagine something hooking into our traffic hit stats, doing a geoip2 lookup of the client IP of each hit -> lat/lon, and simply maintaining a hit-count per grid square. Maybe it outputs grid files with sums once an hour on each frontend traffic machine (in case of failure and such), and some process aggregates them into a summed global grid once a week for actual export to gdnsd (to avoid reacting to small patterns with weekday-vs-weekend and such - we want longer-term averages here).
  • On the explicit config side, gdnsd has the coordinates and weight values of each datacenter. Weight value is in terms of approximate traffic-handling capacity. As with the grid population, it's in unitless values that only matter relative to each other.
  • On startup or dc up/down event, a new runtime map of "which grid squares of clients route to which DCs" must be calculated, hopefully relatively quickly, so we need a reasonably efficient algorithm that gets it approximately right. The idea is (again, thinking in 2-d flat map terms, but the math and method is possible on a real globe as well):
    1. Sum all the weights on the grid into a total population for the globe.
    2. Translate DC weights to DC population counts: if DC1 has 1/35th of the sum datacenter weights, it gets 1/35th of the total global population.
    3. Start with a new blank grid map. Step zero is each DC claims the grid square it occupies (corner cases to disallow, warn, or handle: 2 DCs in the same grid square? DC does not have sufficient weight to handle the population of its own grid square?)
    4. From there each DC claims nearby squares in an outward-spiral pattern from the square it occupies, until it runs out of population count allowed by its weight. Each step of the algorithm iterates the list of DCs and lets them each claim one new square if they still can along their spiral path, skipping over any square already claimed by another. A DCs is done if it can complete a circle without finding an unclaimed square or runs out of population capacity. At the end, every square should be claimed (some pragmatic nits to sort out about leftover capacity because it can't claim one last square that's too big, etc aside), and the map should give clients the geographically-closest datacenter it can within capacity-weight limits.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 7 2015, 2:08 PM
BBlack added a comment.Jul 7 2015, 2:22 PM

I should have added above: that all implies that the runtime per-request lookup process involves client-geoip->coordinates->gridmap-square->DC. That may not jive well with how we do things today with re-processing a geoip database into the largest supernets we can for edns-client-subnet. Perhaps once we look at this data in practice, practical workarounds that don't hurt that too much will be obvious (e.g. readjusting from the real population map to one that re-groups the populations approximately on subnet boundaries, or basically still translating to a subnet tree and treating subnets as covering the squares where they tend to exist, etc). That process is allowed to be a little more expensive, since it can be cached once per population update and not recalculated on a runtime datacenter-down event.

elukey added a subscriber: elukey.Apr 16 2016, 6:53 AM
ema added a subscriber: ema.May 23 2016, 8:29 PM
BBlack moved this task from Triage to DNS Infra on the Traffic board.Sep 30 2016, 2:12 PM
ayounsi added a subscriber: ayounsi.Nov 8 2017, 4:52 PM