Page MenuHomePhabricator

Offer AuthDNS service over IPv6
Open, Stalled, LowPublic

Description

We've never yet offered IPv6-native authdns, for various historical reasons of variable validity.

I think at this point many of the blockers are behind us: IPv6 on the Internet is considerably more-mature now, an increasing percentage of client traffic is really IPv6, our GeoIP databases for IPv6 seem to be of reasonable quality (and we're also using them to route clients anyways, in cases where IPv4 recursors send us IPv6 edns-client-subnet), etc.

It's still not a quick and easy step and not without risk, but it's within reasonable reach.

We're also working on other AuthDNS improvements concurrently though, and I think it makes sense to get through some of those other transitions first. Chiefly, I think we should transition to our Anycasted IPv4 model first ( T98006 ), and then look at adding IPv6 addresses as anycast as well, after that. It just makes for less churn/noise in changes to our upstream NS sets with registrars (we have hundreds of domains to affect), and fewer concurrent experiments in this space.

Details

Reference
rt3772

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:24 AM
rtimport added a project: ops-core.
rtimport set Reference to rt3772.

On Oct 22, 2012, at 8:59 PM, "Andre Klapper via RT" <core-ops at rt>
That's because we do GSLB / geo load balancing based on IPv4-only geoip data. (No, it's not MaxMind). Until that can change and work well enough with IPv6 too, we can't put IPv6 addresses on our NS records and have our DNS auth servers answer on IPv6, unfortunately.
--
Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Wikimedia Foundation

Status changed from 'new' to 'open' by RT_System

Our NS now do respond to 2620:0:861:ed1a::e, 2620:0:860:ed1a::e, 2620:0:862:ed1a::e respectively, but I'm a bit reluctant to put those AAAAs in our NS (I already did for an hour or two, but backed it out soon after that).
The reason is, I'm afraid the quality of GeoIPv6 data might be subpar compared to IPv4 and we might direct people to the wrong DC, worsening the users' experience. This is in combination with how DNS works: it only needs an IPv6-enabled resolver to direct a lot of IPv4-only users to a different continent.
I'll try to get more data on this and perhaps do a switch. I'll keep this open until then.

We still have no news from MaxMind on IPv6 database availability, marking as stalled.

Status changed from 'open' to 'stalled' by faidon

faidon lowered the priority of this task from Medium to Low.Dec 18 2014, 5:27 PM
faidon updated the task description. (Show Details)
faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon set Security to None.
faidon added subscribers: Unknown Object (MLST), Jasper, faidon and 3 others.

GeoIP2 has city resolution for IPv6 now (still of unknown quality). @BBlack has coded support for it in gdnsd that will land in the next release of gdnsd. I've done some work towards integrating the new libraries & geoip code into our infrastructure (packages & puppet code). This is finally, slowly, progressing!

We've now switched to both gdnsd 2.2.0 and GeoIP2, which comes with non-lite City IPv6 support. Next steps are evaluating somehow whether that support is sufficient and then, if so, going ahead and adding AAAAs to our zones and upstream glue records.

What's out evaluation plan here? Do we want to stall on proper IPv6 for in our VCL geoip lookup service first and do comparisons on that data? Or do some kind of direct survey of the two datasets? Or ask MaxMind how they think the relative quality fares?

Even if the V6 data is comparably-good for the V6 internet, we potentially face the additional issue that V6 DNS lookups may route differently than matching V4 user traffic. The scenario would be something like this:

  1. The real client is V4-only (perhaps because their DSL router/modem combo is V4 only because it's an older model). They use the default DNS servers from their ISP (over V4 for client->cache).
  2. Their ISP supports V6 to some degree, and will preferentially send lookups over IPv6 to us from their caches.
  3. Their ISP doesn't support edns-client-subnet (only about 1/3 of our requests have it, so it's not yet common).
  4. Their ISP has significantly different routing to us over IPv6 than IPv4: perhaps they tunnel all their global IPv6 traffic through an exit point in Los Angeles and all their V6 is marked there in MaxMind, but the user is in NYC and v4 would route locally there. This causes their DNS cache request over IPv6 to choose ulsfo for this east-coast user, whereas without authdns AAAA we would've picked the more-appropriate eqiad for them.

I'm honestly not worried all that much about tunnels anymore. In my experience, they're very rare nowadays and especially in this cross-country fashion (Google's 6to4 & Teredo statistics seem to concur).

I don't have any great ideas on how to compare the MaxMind data. Last time I looked up a bunch of RIPE Atlas nodes, since RIPE lists both the IPv4/IPv6 address for each, and found quite a few differences, most of which were of the limited accuracy type (e.g. correctly locating the country but not the city). That said, the Atlas dataset isn't especially great, as it contains a lot of probes located within datacenters and weird address spaces — not exactly an unbiased end-user sample. Perhaps a good approximation of a nameserver address sample, although it's hard to know for sure.

How do you envision testing this with the VCL GeoIP service? I think we have the same kind of concerns for that one too, unless you have thought of a good idea to test both? I suppose we could create a more controlled (and convoluted) experiment where we asynchronously load resources over separate IPv4-only and IPv6-only hostnames with a unique token for both, to check for parity… but still, we wouldn't know which of the two is the right one.

Perhaps we should just try it and look at our performance metrics from a 10.000ft view (page load time etc.)? Thoughts? Any other clever ideas?

For the VCL stuff, what I meant is that for IPv6 user traffic, we could compare the runtime lookup we do for Set-Cookie on the IPv6 address to the one done via IPv4-only geoiplookup-lb. Would require some JS support to tie the two together.

Moving forward and checking perf metrics after is an option, too. But unless the change is quite dramatic it will be hard to see it. Rolling forward and back on that wouldn't be quick with registrar involvement. On top of that there's a fairly long TTL smear before everyone switches to IPv6 queries that can. Then during all of this there will of course be continuing deployments on various levels that affect perf as well.

[edit: removed wrong stuff about glue]

BBlack renamed this task from No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org to Offer AuthDNS service over IPv6.Dec 6 2019, 2:20 PM
BBlack removed faidon as the assignee of this task.
BBlack updated the task description. (Show Details)

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

What's the status on this fix? Currently, an IPv6-only client using an IPv6-only DNS resolver will fail to reach wikimedia services. If their DNS resolver is capable of using a NAT64 translator, that might be a reasonable workaround (as suggested by https://datatracker.ietf.org/doc/draft-momoka-v6ops-ipv6-only-resolver/), but a workaround shouldn't be necessary for foundational websites like wikipedia.

ssingh added subscribers: cmooney, ssingh.

In discussion with @cmooney, we will be revisiting this task again when Traffic does some other authdns-related work, so removing it from the Traffic-Icebox.

I note that a current draft in the IETF DNSOPS Working Group, aimed to replace RFC3901, draft-momoka-dnsop-3901bis-03 states:

Every authoritative DNS zone SHOULD be served by at least one IPv6-reachable authoritative name server to maintain name space continuity. The delegation configuration (Resolution of the parent, resolution of out-of-bailiwick names, GLUE) MUST not rely on IPv4 connectivity being available.