Page MenuHomePhabricator

Some Traffic clusters apparently do not support IPv6
Closed, ResolvedPublic

Description

Greetings!

During importation of DNS into Netbox as part of the transition to automation, we discovered some clusters do not have IPv6 DNS entries, which we interpreted as intentional (given that this was the mechanism used to prevent potential clients from accessing the IPv6 interfaces on the machine, if a given service did not support IPv6), and prevented from being imported into automation.

We are now triaging these clusters for their potential at supporting IPv6 in the future, so below are hosts which were left out of IPv6 DNS which we think that your team is responsible for. If you could take some time to put any information you have about supporting IPv6 on these clusters, specific plans for doing so, or if it will not in the forseeable future be possible to do so, it would be greatly appreciated!

If any of these machines don't belong to you let us know on this ticket or the parent task (T253173), thanks!

  • lvs[2007-2010].codfw.wmnet
  • lvs[1013-1016].eqiad.wmnet
  • lvs[4005-4007].ulsfo.wmnet

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Joe triaged this task as Low priority.Jan 5 2021, 7:52 AM
Joe added a subscriber: Joe.

I will let the traffic folks answer as well, but first of all I think you should clarify a bit better the wording of the task. For instance, I struggle to understand what "some clusters do not have IPv6 DNS entries" means in this context. Specifically, you're naming servers in eqiad, codfw and ulsfo, and a quick query returns:

$ for dc in eqiad codfw ulsfo; do N="text-lb.$dc.wikimedia.org"; IP6=$(dig +short -t AAAA $N); echo $N $IP6; done
text-lb.eqiad.wikimedia.org 2620:0:861:ed1a::1
text-lb.codfw.wikimedia.org 2620:0:860:ed1a::1
text-lb.ulsfo.wikimedia.org 2620:0:863:ed1a::1
$ for dc in eqiad codfw ulsfo; do N="upload-lb.$dc.wikimedia.org"; IP6=$(dig +short -t AAAA $N); echo $N $IP6; done
upload-lb.eqiad.wikimedia.org 2620:0:861:ed1a::2:b
upload-lb.codfw.wikimedia.org 2620:0:860:ed1a::2:b
upload-lb.ulsfo.wikimedia.org 2620:0:863:ed1a::2:b

which are the public IPV6 addresses to reach all of our services.

So my guess is you're just not seeing an IPv6 address associated with the aforementioned load balancers?

(Also setting the priority to low as this is not an ongoing production problem, correct me if I'm wrong).

The point of the project is to get as many hosts to have an IPv6 address (and, obviously, to be functional on that address) as we can, and, in general, for it to be default to have IPv6 addresses in DNS. If it's not appropriate for a particular cluster, that's a valid outcome.

In this case the load balancer servers listed above indeed do not have IPv6 DNS. This ticket requests any information needed to add these hosts's IPv6 addresses to our DNS or to prompt the actions required to do so.

The point of the project is to get as many hosts to have an IPv6 address (and, obviously, to be functional on that address) as we can, and, in general, for it to be default to have IPv6 addresses in DNS. If it's not appropriate for a particular cluster, that's a valid outcome.

In this case the load balancer servers listed above indeed do not have IPv6 DNS. This ticket requests any information needed to add these hosts's IPv6 addresses to our DNS or to prompt the actions required to do so.

Just to be extra clear, I mean host addresses throughout this.

BBlack added a subscriber: BBlack.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@BCornwall (see also on IRC in #wikimedia-traffic from this morning for additional context).
This morning the Icinga alert:

PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes

Fired and looking at the pending diff, all the records were related to the addition of AAAA records to the lvs[2007-2010] hosts.
When changing DNS Names in Netbox it's also needed to run the sre.dns.netbox cookbook to make the changes go live in our DNS system. For more details on how this process works see https://wikitech.wikimedia.org/wiki/DNS/Netbox.

After discussing with @Vgutierrez on IRC, because it's not clear if we actually need DNS names for the SLAAC mngtmpaddr addresses, in addition with the wrong interface name, we've decided to revert some of the changes in Netbox, but in a way that it would be easy to re-add them in case it's indeed needed.
This allowed to get back to having Netbox and prod DNS in sync and made Icinga happy.
I've just unset the DNS name of the added addresses and saved them so that we can re-add them if needed.

For my reference, to be able to restore them if needed, this is the code I've run in Netbox:

>>> import uuid
>>> request_id = uuid.uuid4()
>>> user = User.objects.get(username='volans')
>>> devices = ['lvs2007', 'lvs2008', 'lvs2009', 'lvs2010']
>>> original_records = []
>>> for name in devices:
...     device = Device.objects.get(name=name)
...     for iface in device.interfaces.all():
...         for addr in iface.ip_addresses.all():
...             if addr.family != 6:
...                 continue
...             original_records.append((addr, addr.dns_name))
...             addr.dns_name = ''
...             log = addr.to_objectchange('update')
...             log.request_id = request_id
...             log.user = user
...             log.save()
...             addr.save()
...
>>> for i in original_records:
...     print(i)
...
(<IPAddress: 2620:0:860:101:10:192:1:7/64>, 'lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:1:20a:f7ff:feef:ea40/64>, 'vl2001-enp59s0f0.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:2:20a:f7ff:feef:ea41/64>, 'vl2002-enp59s0f1d1.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:102:20a:f7ff:feef:ea41/64>, 'vl2018-enp59s0f1d1.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:3:20a:f7ff:fef0:320/64>, 'vl2003-enp175s0f0.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:103:20a:f7ff:fef0:320/64>, 'vl2019-enp175s0f0.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:4:20a:f7ff:fef0:321/64>, 'vl2004-enp175s0f1d1.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:104:20a:f7ff:fef0:321/64>, 'vl2020-enp175s0f1d1.lvs2007.codfw.wmnet')
(<IPAddress: 2620:0:860:102:10:192:17:7/64>, 'lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:2:20a:f7ff:feef:e660/64>, 'vl2002-enp59s0f0.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:1:20a:f7ff:feef:e661/64>, 'vl2001-enp59s0f1d1.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:101:20a:f7ff:feef:e661/64>, 'vl2017-enp59s0f1d1.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:3:20a:f7ff:feef:edc0/64>, 'vl2003-enp175s0f0.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:103:20a:f7ff:feef:edc0/64>, 'vl2019-enp175s0f0.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:4:20a:f7ff:feef:edc1/64>, 'vl2004-enp175s0f1d1.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:104:20a:f7ff:feef:edc1/64>, 'vl2020-enp175s0f1d1.lvs2008.codfw.wmnet')
(<IPAddress: 2620:0:860:103:10:192:33:7/64>, 'lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:3:20a:f7ff:feef:ee00/64>, 'vl2003-enp59s0f0.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:1:20a:f7ff:feef:ee01/64>, 'vl2001-enp59s0f1d1.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:101:20a:f7ff:feef:ee01/64>, 'vl2017-enp59s0f1d1.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:2:20a:f7ff:fef0:b70/64>, 'vl2002-enp175s0f0.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:102:20a:f7ff:fef0:b70/64>, 'vl2018-enp175s0f0.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:4:20a:f7ff:fef0:b71/64>, 'vl2004-enp175s0f1d1.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:104:20a:f7ff:fef0:b71/64>, 'vl2020-enp175s0f1d1.lvs2009.codfw.wmnet')
(<IPAddress: 2620:0:860:104:10:192:49:7/64>, 'lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:4:20a:f7ff:fef0:240/64>, 'vl2004-enp59s0f0.lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:1:20a:f7ff:fef0:241/64>, 'vl2001-enp59s0f1d1.lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:101:20a:f7ff:fef0:241/64>, 'vl2017-enp59s0f1d1.lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:2:20a:f7ff:fef0:c10/64>, 'vl2002-enp175s0f0.lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:102:20a:f7ff:fef0:c10/64>, 'vl2018-enp175s0f0.lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:3:20a:f7ff:fef0:c11/64>, 'vl2003-enp175s0f1d1.lvs2010.codfw.wmnet')
(<IPAddress: 2620:0:860:103:20a:f7ff:fef0:c11/64>, 'vl2019-enp175s0f1d1.lvs2010.codfw.wmnet')
>>>

The actual changes in Netbox can be seen here:
https://netbox.wikimedia.org/extras/changelog/?request_id=edca8766-b89c-4fa1-abf4-fb71e36d2578

I've left my tmux open so that the revert is even easier in case it's needed.

For additional context, this task was supposed to affect only the host's primary IPv6 address, not all of them.
As of now it seems that most LVS hosts do have AAAA records for their primary IPv6 address. The only ones that are missing the IPv6 address are:

lvs[1013-1016,2007-2010]

This would suggest that the LVS setup does actually fully support having AAAA records on their primary IP addresses. Although, in the general case, having some hosts with AAAA records and some without usually means that either one of the following is true:

  • the hosts with the AAAA records are actually mis-configured, and might be causing live issue to clients connecting to them (such as trying on the v6 address first, fail, and then retry to the v4 one) and we should remove their AAAA records.
  • the hosts with the AAAA records are working fine, demostrating that the service does actually support IPv6 and so we could just add the missing AAAA records for the remaining hosts and be fully IPv6 compatible.

If the hosts that have AAAA records are functioning correctly, as I suspect is the case here, feel free to get in touch with me so that we can add the missing AAAA records to the remaining hosts of the cluster programmatically instead of doing it manually.

Thank you for doing that, @Volans ; I apologize for forgetting to run the cookbook.

I'm a little confused here regarding only setting DNS for the real interface: I see other servers with all interfaces having DNS records, e.g. https://netbox.wikimedia.org/dcim/devices/3482/interfaces/... And when @ayounsi kindly fixed up lvs[4005-4007].ulsfo.wmnet they included the virtual interfaces as well (though with a different naming scheme). If the IPv4 addresses have their own DNS name, why not the IPv6? I don't see any mention in the ticket of it only being the primary IP as well...

I got an answer from @Vgutierrez regarding the fact that the interface names in the DNS name didn't match the actual host interface:

Predictable Network Interfaces naming aren't as predictable as you would imagine
so in the update from stretch to buster those names changed and probably the DNS records weren't upgraded
the source of truth regarding IP<->interface mapping for LVS can be found in the puppet repo, hieradata/common/lvs/interfaces.yaml

Am I correct in that, minus my negligence of running the cookbook after changing, the changes are correct?

fixed up lvs[4005-4007].ulsfo.wmnet

For context: T311290

The the issue is twofold:

1/ the LVS hosts use SLAAC IPs on their non-primary interfaces
This is not a standard practice and should be avoided in favor of v4 mapped IPs (or in this case, it could even work without IPs).
However fixing this could be time consuming and impactful, especially as people are working on the future L4LB.
In concrete terms, it means that:

  • the v6 IP there will change if the network card changes, making the DNS name association incorrect (and will be deleted at the next PuppetDB import Netbox script)
  • if for some reasons router advertisements stop on that vlan, the host will lose its v6 IP (which is more problematic but a different issue)

While it's important to have them for their primary IPs (the ones that resolve when ssh ing to the host), I could go both ways about adding a dns_name to the SLAAC records. But as they don't change often I have slight preference to having DNS records to all the live IPs on our network for sake of consistency.

2/ we don't have a clear naming convention for unusual interfaces (and Predictable Network Interfaces doesn't help)
With T311290 - https://netbox.wikimedia.org/ipam/ip-addresses/?q=vl1201
I decided to use .wikimedia.org to show that they are public IPs, and what matters here is the vlan they're in, so for example vl1201.lvs4005.wikimedia.org this have the advantage of not being impacted by PNI.

Happy to go with anything else as long as it's consistent at least across the LVS.

Thank you for doing that, @Volans ; I apologize for forgetting to run the cookbook.

No problem.

I'm a little confused here regarding only setting DNS for the real interface: I see other servers with all interfaces having DNS records, e.g. https://netbox.wikimedia.org/dcim/devices/3482/interfaces/... And when @ayounsi kindly fixed up lvs[4005-4007].ulsfo.wmnet they included the virtual interfaces as well (though with a different naming scheme). If the IPv4 addresses have their own DNS name, why not the IPv6? I don't see any mention in the ticket of it only being the primary IP as well...

I got an answer from @Vgutierrez regarding the fact that the interface names in the DNS name didn't match the actual host interface:

Predictable Network Interfaces naming aren't as predictable as you would imagine
so in the update from stretch to buster those names changed and probably the DNS records weren't upgraded
the source of truth regarding IP<->interface mapping for LVS can be found in the puppet repo, hieradata/common/lvs/interfaces.yaml

Am I correct in that, minus my negligence of running the cookbook after changing, the changes are correct?

From my PoV and the original goal of this task, the important part is to have the AAAA records on the primary IPv6 of the host (the one returned when doing host $hostname for example). Unless blocked by specific reasons (incompatibilities with IPv6, missing tooling to manage ACLs/grants/etc..) we should aim to have all hosts with both IPv4 and IPv6 primary IPs with related A/AAAA DNS records.
Given that the only lvs without the AAAA records are lvs[1013-1016,2007-2010], it seems reasonable and probably safe to add them given that all the others already have them.

Beside that, for the SLAAC IPv6 addresses that are also mngtmpaddr on the kernel side, I'll leave it to your team and netops to decide what's the best course of action there.

Whatever is the decision, let me know if you want me to restore your changes without having to do them manually ;)

Just for completeness, and to use the same wording I'm using for other tasks.

Some clusters managed by the Traffic team have inconsistent AAAA DNS records for the primary IPv6 of the hosts. Some hosts have the AAAA record in the DNS for their primary IPv6 address, some don't.
See https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters for more details about the possible risks of the current setup and the two alternative actions to move forward.

This is the list of the affected clusters and related hosts as of 04/07/2022:

  • lvs*:
    • have the AAAA record: lvs[1017-1020,3005-3007,4005-4007,5001-5003,6001-6003]
    • lack the AAAA record: lvs[1013-1016,2007-2010]
BCornwall added a subscriber: ssingh.

Thank you for the help, @ssingh, @Volans and @ayounsi

I've added the DNS records to only the primary interfaces of all LVS hosts and committed them via the cookbook. @Volans, I extra-appreciate your consideration in making an easy-revert situation for me. I didn't need it since changing the records was quick enough to do manually, but it was quite nice of you to do.

Thanks @BCornwall for the quick turnaround and fix. I'll close the tmux then given the revert is not needed anymore.

Change 812050 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] reports/network: remove lvs* from no IPv6 list

https://gerrit.wikimedia.org/r/812050

Change 812050 merged by jenkins-bot:

[operations/software/netbox-extras@master] reports/network: remove lvs* from no IPv6 list

https://gerrit.wikimedia.org/r/812050

I've removed the lvs prefix from the no IPv6 cluster list and now the Network report in Netbox confirms there are no lvs hosts there anymore. Thanks all for fixing this.