Page MenuHomePhabricator

Fix ipv6 autoconf issues
Closed, ResolvedPublic

Description

Currently most hosts have both an explicit, static IPv6 address used in DNS, as well as an autoconfigured one based on router advertisments and the local mac address. Occasionally, depending on the phase of the moon, a host's current address configuration will end up in a state where the autoconfigured address takes precendence for outbound traffic, which has caused us all sorts of headaches in the past with e.g. firewall rules and other things that rely on correct IPv6 addressing or correct v6 reverse lookups.

With Interdatacenter-IPsec , the issue becomes more critical as there wouldn't be any configured IKE security association for the macaddr-based autoconfigured addresses, causing traffic to be sent unencrypted.

Event Timeline

BBlack claimed this task.
BBlack raised the priority of this task from to High.
BBlack updated the task description. (Show Details)
BBlack subscribed.

https://gerrit.wikimedia.org/r/#/c/200592/
^ (guess I need to remember to use correct task-ref syntax in PS1 every time)

Change 202725 had a related patch set uploaded (by BBlack):
add_ip6_mapped: enable token-based SLAAC for all jessie/trusty

https://gerrit.wikimedia.org/r/202725

BBlack added a subscriber: faidon.

So, this issue is really complicated when you get into the details. @faidon and I have had several irc brainstorming conversations about this over the past months that went on for pages and pages and have never quite come to agreement about a reasonable solution. I'm going to try to recap here the knowledge we have about the current situation, the known possible solutions, and their various caveats and tradeoffs.

Current Situation:

Commonly, nodes are provisioned with fixed IPv6 addresses via interface::add_ip6_mapped, which is the first 30 lines or so of: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/interface/manifests/add_ip6_mapped.pp (the rest is testing future directions relevant to later parts of this text). These fixed addresses have a mechanical 1:1 mapping with our IPv4 addressing, and are commonly also configured in our DNS zonefiles.

We do not explicitly configure IPv6 default gateway addresses, as Linux picks that up for us from router advertisements. The router advertisements plus default Linux behavior also cause the creation of unmanaged, autoconfigured IPv6 addresses on these interfaces. The lower 64 bits of these addresses are derived from (but very slightly different than) the interface's link-level MAC address. Thus it is common for a host to look like this in practice:

cp3030 showing IPv4, mapped-IPv6, and autoconf IPv6:

root@cp3030:~# ip addr ls dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq portid 44a8420a1118 state UP group default qlen 1000
    link/ether 44:a8:42:0a:11:18 brd ff:ff:ff:ff:ff:ff
    inet 10.20.0.165/24 brd 10.20.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2620:0:862:102:46a8:42ff:fe0a:1118/64 scope global mngtmpaddr dynamic
       valid_lft 2592000sec preferred_lft 604800sec
    inet6 2620:0:862:102:10:20:0:165/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::46a8:42ff:fe0a:1118/64 scope link
       valid_lft forever preferred_lft forever

IPv6 def gw from adverts:

root@cp3030:~# ip -6 route list exact default dev eth0
default via fe80::1  proto ra  metric 1024  expires 11sec hoplimit 64
default via fe80::fe00:0:0:2  proto ra  metric 1024  expires 10sec hoplimit 64
default via fe80::fe00:0:0:1  proto ra  metric 1024  expires 10sec hoplimit 64

For traffic inbound to the DNS-configured hostname of the machine in question (or the explicit addr the DNS resolves to), none of this is really an issue.

Problem Statement:

The problem comes in when the host selects a default source address for outbound IPv6 traffic. At that point a local algorithm picks one of the two valid addresses in 2620:0:862:102::/64 above, which is either of the autoconf or explicit ones. If it choses the autoconf one, we're in trouble because that's not likely to configured in e.g. firewall tables, access lists, etc. It also doesn't have reverse mapping in DNS, and we wouldn't want to map it into our configuration and/or DNS because, again, it's based on hardware macaddr and thus subject to change if the physical machine or its network card is replaced. Therefore, choosing the autoconf addr as a default outbound source is always a Bad Thing.

If Linux never chose the autoconf address, we wouldn't care, but sometimes it does. This behavior seems to be non-deterministic (but stable for stable interface config) and depends on timing races during reboot/configuration that affect the order of the two addresses in the address list. When the problem is observed (sourcing from autoconf addr), we usually just delete and re-add one of the addresses to change the order around and things start working again, but this is very undesirable. We want the configuration to deterministically always send traffic from the explicitly-configured address alone, especially in a world where securing IPSec traffic based on the explicitly-configured address pairs matters.

Proposal 0 - disable address autoconf via sysctl naively:

Attempted at various points in the past. Seems to have race conditions trying to get it working in sync with interface-up events, definitely doesn't work out right in corner-cases like LVS multi-interface, etc. Prop 2 further down is a re-distillation of this idea into something that could actually work, on a per-interface basis, at least for physical interfaces.

Proposal 1 - Tokens:

This is my current frontrunner. The idea is that we use the ip token command to configure the interface's token for IPv6 autoconf. This works with all of the autoconf mechanisms in place. The selected token is the lower 64 bits that are used when creating the autoconf address. By setting them to the same value as the fixed address, the two addresses merge in the address table. This can be done in a pre-up before the interface is ever configured so that there are no races, and the truly-fixed address config can be left in place as well, avoiding a potential problem with race conditions on network up with a daemon trying to bind to the autoconf address (which may not exist for a full second or two in that case, while waiting on an advertisement to arrive).

This is the solution already laid out (just for testing on cp1008) at the bottom of the add_ip6_mapped manifest linked earlier (but the comment block there explaining its behavior is actually wrong, so ignore it).

Pros:

  • Relatively-simple to configure
  • Deterministic in all the ways that matter
  • Re-uses the default IPv6 autoconf behaviors, just directs them to do better things.

Cons:

  • Only works for jessie/trusty hosts; precise's kernel and/or ip tools lack ip token support
    • (I think this is ok - it fixes it for the newer hosts we care about, and the rest keep the status quo until the reinstalls they're all due for)
  • Relies on autoconf mechanisms; if these ever break at the routers for a week or more, traffic could be hosed
    • (but note, in all of these solutions we rely on adverts for default gateway on much shorter timescales, and loss of autoconf could be monitoring in icinga)

Proposal 2 - rdisc6 + disable accept_ra_pinfo:

This is fairly-well documented in an ugly patch here: https://gerrit.wikimedia.org/r/#/c/203069/

Pros:

  • Disables Linux-level address autoconf at the host on a per-interface basis, uses rdisc6 to find prefix from the same advertisements manually, and only at configure-time, not at runtime.

Cons:

  • Much more complex and fragile, as indicated in the patch, as there are all sorts of races to work around with regard to the sysctl settings, and because rdisc6 and address math has to be wrapped in a local helper script.

Proposal 3 - Disable prefix at router:

We believe that juniper configuration can be made to still advertise the default gateway, but not advertise prefix information.

Pros:

  • No more autoconf addrs at all, problem evaporates

Cons:

  • Even today's explicit configuration relies on autoconf in order to initially obtain the prefix. We'd need to explicitly configure the network prefixes (in puppet, etc) for hosts, so that puppet can know (back at the master) and configure the entire address explicitly, which is something we don't even do for IPv4 today.
  • We lose any other secondary use of prefix adverts for other arbitrary hosts/devices. For example, we'd lose standard IPv6 autoconf for laptops plugged into the DC for debugging, any "devices" like PDUs, etc. Probably not a big deal, but it really highlights that we're going against the normal grain of how IPv6 networks should work.

Proposal 4 - DHCPv6

We could deploy a DHCPv6 service in place of autoconf, and set it up to explicitly hand out our mapped addresses based on DNS lookups. Hasn't been investigated deeply, would probably combine with disabling prefix adverts at the router, maybe be complicated and hacky. Kind of an unknown at this point. Sounds complicated.

The token-based solution (Proposal 1) sounds good to me; it seems like the only barrier to adoption is making a policy decision to go with a proposal which doesn't support Precise, correct?

I tested this by hand and got the expected results:

gage@curium:~$ ip token get
token :: dev eth0
token :: dev eth1
token :: dev eth2
token :: dev eth3
gage@curium:~$ sudo ip token set ::10:64:0:170 dev eth0
gage@curium:~$ ip token get
token ::10:64:0:170 dev eth0
token :: dev eth1
token :: dev eth2
token :: dev eth3

ip addr show dev eth0, initial state:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 84:2b:2b:fd:be:6d brd ff:ff:ff:ff:ff:ff
    inet 10.64.0.170/22 brd 10.64.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:101:862b:2bff:fefd:be6d/64 scope global mngtmpaddr dynamic
       valid_lft 2591904sec preferred_lft 604704sec
    inet6 2620:0:861:101:10:64:0:170/64 scope global
       valid_lft 2592000sec preferred_lft 604800sec
    inet6 fe80::862b:2bff:fefd:be6d/64 scope link
       valid_lft forever preferred_lft forever

Manually triggered an address update by lowering the lifetimes:

gage@curium:~$ sudo ip addr change 2620:0:861:101:862b:2bff:fefd:be6d/64 dev eth0 valid_lft 10 preferred_lft 10

Afterward the fefd:be6d address disappeared, leaving only the token-based IPv6 global address (with appropriate lifetimes):

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 84:2b:2b:fd:be:6d brd ff:ff:ff:ff:ff:ff
    inet 10.64.0.170/22 brd 10.64.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:101:10:64:0:170/64 scope global
       valid_lft 2592000sec preferred_lft 604800sec
    inet6 fe80::862b:2bff:fefd:be6d/64 scope link
       valid_lft forever preferred_lft forever

Result: with ipsec.conf set to auto=route mode in which connections are not attempted until traffic is sent, ping6 berkelium successfully triggers establishment of ESP transport for IPv6, where previously it failed due to using the MAC-based source address. This fixes the behavior described in the Problem Statement.

BBlack added a parent task: Restricted Task.May 6 2015, 7:23 PM

Change 202725 abandoned by BBlack:
add_ip6_mapped: enable token-based SLAAC for all jessie/trusty

https://gerrit.wikimedia.org/r/202725

The token approach was deployed for all jessie/trusty nodes with add_ip6_mapped.