Page MenuHomePhabricator

keepalived: it doesn't support mixing IPv4 and IPv6 VIPs on the same VRRP instance
Closed, ResolvedPublic

Description

Apparently, because of how VRRP itself works, we cannot mix IPv4 and IPv6 VIPs on the same VRRP instance in keepalived.

For stuff needing IPv6 VIPs, we need either:

  • refactor the keepalived puppet module to have explicit IPv6 support
  • move to a different VIP settings, make BGP-based anycast VIPs

Event Timeline

Change #1079234 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: keepalived: support separate IPv6 VRRP instance

https://gerrit.wikimedia.org/r/1079234

Change #1079234 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: keepalived: support separate IPv6 VRRP instance

https://gerrit.wikimedia.org/r/1079234

aborrero triaged this task as Medium priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

I have detected there is no VRRP connectivity for the V6 addresses.

Also, I have noticed they are using an unexpected source IPv6 address to send the VRRP announcements:

10:47:07.649135 IP6 fe80::d28e:79ff:fef5:8644 > 2a02:ec80:a100:fe04::2002:1: VRRPv3, Advertisement, (ttl 64), vrid 52, prio 55, intvl 100cs, length 40
10:47:08.638012 IP6 fe80::2eea:7fff:fe7b:e104 > 2a02:ec80:a100:fe04::2003:1: VRRPv3, Advertisement, (ttl 64), vrid 52, prio 47, intvl 100cs, length 40
10:47:08.649258 IP6 fe80::d28e:79ff:fef5:8644 > 2a02:ec80:a100:fe04::2002:1: VRRPv3, Advertisement, (ttl 64), vrid 52, prio 55, intvl 100cs, length 40

Change #1079246 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: fix keepalived IPv6 setting

https://gerrit.wikimedia.org/r/1079246

Change #1079246 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: fix keepalived IPv6 setting

https://gerrit.wikimedia.org/r/1079246

Mentioned in SAL (#wikimedia-cloud) [2024-10-10T12:04:53Z] <arturo> manual network failover in cloudgw because maintenance related to T376879

still not working. I saw this weird tcpdump capture on cloudgw2003-dev:

12:25:15.430552 vlan2107 Out IP6 2a02:ec80:a100:fe04::2003:1 > 2a02:ec80:a100:fe04::2002:1: VRRPv3, Advertisement, (ttl 64), vrid 52, prio 55, intvl 100cs, length 40
12:25:15.430555 eno1  Out IP6 version error: 8 != 6
12:25:15.430790 eno1  In  IP6 version error: 14 != 6
12:25:15.430792 vlan2107 In  IP6 2a02:ec80:a100:fe04::2002:1 > 2a02:ec80:a100:fe04::2003:1: ICMP6, parameter problem, next header - octet 6, length 88

One thing I might be messing you up is the "authentication" section in /etc/keepalived/keepalived.conf. AFAIK VRRP_v3 doesn't support authentication, so those blocks should probably be removed.

One thing I might be messing you up is the "authentication" section in /etc/keepalived/keepalived.conf. AFAIK VRRP_v3 doesn't support authentication, so those blocks should probably be removed.

removed the auth section from the config, did not make any difference. Still the VRRP daemons are unable to communicate, likely because whatever is causing that parameter problem error.

there is also this warning in the logs:

Oct 10 13:07:05 cloudgw2002-dev Keepalived[3543666]: WARNING - keepalived was built for newer Linux 5.10.84, running on Linux 5.10.0-30-amd64 #1 SMP Debian 5.10.218-1 (2024-06-01)

there is also this warning in the logs:

Oct 10 13:07:05 cloudgw2002-dev Keepalived[3543666]: WARNING - keepalived was built for newer Linux 5.10.84, running on Linux 5.10.0-30-amd64 #1 SMP Debian 5.10.218-1 (2024-06-01)

I have tested with downgrading the keepalived version (we were using one from -bpo). The downgraded version is also showing a similar message, and the IPv6 issue continues to exists.

At the moment, I have no more clues about what is happening. I'll continue tomorrow.

Ipv6 vrrp is all link-local if I recall correctly. Did you configure it like that?

Mentioned in SAL (#wikimedia-cloud) [2024-10-11T09:51:42Z] <arturo> cloudgw network maintenance related to T376879

Ipv6 vrrp is all link-local if I recall correctly. Did you configure it like that?

Thanks, this was the key for fixing the problems we were observing. I really appreciate you chiming in :-)

It is now working as expected.