Page MenuHomePhabricator

Investigate using RFC 7838 Alternate Services to better optimize edge connections
Open, MediumPublic

Description

Currently, we only do Geographic routing to our various global edge nodes via DNS lookup. The DNS-based routing looks at RFC 7871 edns-client-subnet data if available, and otherwise falls back to locating the user based on the exit IP (facing Wikimedia Auth DNS) of the DNS cache they're using. It's possible for either of these IPs (especially the DNS cache IP) to not reflect the client's location quite as accurately as the true HTTP-level client IP does once they've already connected.

RFC 7838 provides a mechanism by which one can re-target an origin after initial connection establishment. The basic idea behind using it in this way flows something like this:

  1. UA performs DNS lookup through non-optimal DNS cache
  2. DNS response gives non-optimal answer with an IP address from esams (Amsterdam).
  3. UA connects directly to esams cache edge.
  4. esams cache sees the true client IP (for the first time), and decides (based on same GeoIP routing we use for the DNS layer) that eqiad would actually be a better choice.
  5. esams cache set an RFC 7838 Alt-Svc: header indicating the UA should hit eqiad directly instead (using an eqiad-specific hostname which doesn't have to match TLS SNI, e.g. text-lb.eqiad.wikimedia.org)
  6. Expected UA behavior is approximately that it will continue transferring data with esams while also establishing a new connection to eqiad, and that once the eqiad connection is established it will send all future requests directly to eqiad (remainder of any initial burst of requests that hadn't yet been sent to esams, as well as requests from new navigations in the near future, up to a maximum age we can control and define).

There's a bunch of details to sort out here both in terms of the implementation and the interface to the UA.

Probably a good starting point would be to define an efficient way for the caches to even perform a GeoDNS-equivalent lookup on the true client IP, and then start logging on our end how often this reveals mismatches and investigating some of them to confirm that the technique looks useful, and then later try turning it on and observing for perf differentials.

Implementation details to think about:

  • Efficient pathway for HTTP caches to lookup client IPs against the routing data we use for GeoDNS. Efficient enough for once-per-connection sorts of volumes, at least.
  • Selecting a decent ma (max age) parameter for the Alt-Svc: outputs, similar to our DNS TTLs? Shorter? Longer? Does any of the client logic baked into the RFC make longer ones acceptable because failover back through the true origin is specified?
  • How do we keep it efficiently refreshed? Ideally we'd re-send Alt-Svc from the correct site to refresh ma values with every response for a redirected client, but it would seem silly to send them constantly for the (probably most-common by far) case where no redirection was needed. Sending it only in the redirected case requires knowing which minority clients of a site arrived there via redirection with this mechanism. The RFCs says UAs SHOULD send an Alt-Used header which clarifies this, but it's not a requirement. This is one of many parts of this where we'll want to look at the behavior of popular real UAs.
  • Loop avoidance: if we're not careful about some of the low-level details, we could create situations where we loop/bounce a client rapidly between two sites with this mechanism, especially e.g. as a race-case with asynchronous updates of the GeoIP routing databases at the sites.

This mechanism is also potentially useful for sending logged-in traffic more directly towards the core DCs. Especially in certain cases, going through our edge caches can be a net loss today when they get no cache benefits and possibly higher latency. Our classic use-case here is an editor in Australia using the Singapore cache, which adds slightly to their net latency without any caching benefits vs connecting directly to one of the core sites.

Event Timeline

BBlack triaged this task as Medium priority.Oct 29 2018, 5:14 PM
BBlack created this task.
Imarlier subscribed.

@BBlack We're moving this to our radar for now (on the assumption that we don't need to do the actual work), but would love to be involved in scheduling testing/evaluation when it's in a position to be tested.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!