Page MenuHomePhabricator

Some Foundation clusters do not appear to support IPv6
Closed, ResolvedPublic

Description

Hello, some of our clusters that we're apparently responsible for may not have IPv6 support, if we could at least characterize the problems with these that'd be cool (I don't have any specific knowledge of any of them).

  • auth2001.codfw.wmnet
  • auth1002.eqiad.wmnet
  • ganeti[1001-1022].eqiad.wmnet - waiting for buster
  • ganeti[2001-2024].codfw.wmnet
  • ganeti[3001-3003].esams.wmnet
  • ganeti[4001-4003].ulsfo.wmnet
  • ganeti[5001-5003].eqsin.wmnet
  • ping1001
  • ping2001
  • ping3001

and CC from Analytics

  • kafka-main[2001-2005].codfw.wmnet

Event Timeline

crusnov triaged this task as Medium priority.Jan 4 2021, 6:43 PM

adding netops because ping* offload servers are in their domain, right?

elukey subscribed.

Removing the Analytics tag since kafka-main servers are managed by SRE (it is the codfw cluster for the jobqueue etc..) :)

ayounsi updated the task description. (Show Details)
ayounsi unsubscribed.
jbond updated the task description. (Show Details)

as ganeti[3001-3003].esams.wmnet and ganeti[4001-4003].ulsfo.wmnet allready have AAAA records configured I'm assuming it should be safe to add them to the other ganeti clusters; however would be good to have someone more familure with ganetie to confirm @MoritzMuehlenhoff?

as ganeti[3001-3003].esams.wmnet and ganeti[4001-4003].ulsfo.wmnet allready have AAAA records configured I'm assuming it should be safe to add them to the other ganeti clusters; however would be good to have someone more familure with ganetie to confirm @MoritzMuehlenhoff?

One notable difference is that eqiad/codfw are still on Stretch. It'll probably work fine, but it's a bit of an unknown and let's better make the switch when upgrading to Buster. But eqsin can go ahead for sure.

One notable difference is that eqiad/codfw are still on Stretch. It'll probably work fine, but it's a bit of an unknown and let's better make the switch when upgrading to Buster.

Sounds good thanks

But eqsin can go ahead for sure.

Just wanted to note that currently only esams has the AAAA record, i will plan to enable them for esqin and ulsfo

I guess given that the ganeti clusters will wait for Buster, the Kafka cluster is the only one remaining. What needs to be done for this?

It appears the kafka-main2* cluster is indeed listening on ipv6, it just seems to need DNS (especially in the face of the eqiad ones already having this DNS). Is there any particular care that's needed here?

We have a special setting in commons.yaml, kafka_brokers_main, that it is used IIRC to instruct zookeeper about what connections to accept, and I see that it already includes kafka-main200[1-3]'s ipv6 addresses, so in theory we should be good. kafka-main200[4,5] have role insetup so they can be done anytime.

Tried a simple telnet and it worked via ipv6:

elukey@kafka-main2001:~$ telnet conf2001.codfw.wmnet 2181
Trying 2620:0:860:101:10:192:0:143...
Connected to conf2001.codfw.wmnet.

The other side effect will be that kafka clients may start using the ipv6 addresses to send traffic to, but I can't think about any problem on this side. Since it is a delicate cluster I'd do a slow rollout (one host at the time), and also inform ServiceOps about it as FYI :)

We have a special setting in commons.yaml, kafka_brokers_main, that it is used IIRC to instruct zookeeper about what connections to accept, and I see that it already includes kafka-main200[1-3]'s ipv6 addresses, so in theory we should be good. kafka-main200[4,5] have role insetup so they can be done anytime.

Tried a simple telnet and it worked via ipv6:

elukey@kafka-main2001:~$ telnet conf2001.codfw.wmnet 2181
Trying 2620:0:860:101:10:192:0:143...
Connected to conf2001.codfw.wmnet.

The other side effect will be that kafka clients may start using the ipv6 addresses to send traffic to, but I can't think about any problem on this side. Since it is a delicate cluster I'd do a slow rollout (one host at the time), and also inform ServiceOps about it as FYI :)

Beauty. Thanks for the followup. To be clear:

  • We can go ahead and add kafka-main-2* ipv6 DNS any time?
  • It is likely we can add kafka-main1* ipv6, but we should do them one host at a time.

For the eqiad ones, how do we proceed in testing them? What should we look out for?

Thanks!

@crusnov the eqiad ones have AAAA records afaics, so we should be good on that side. For the codfw ones, I'd pick one host (say kafka-main2001) and I'd add the AAAA record for it (alerting serviceops first), and then if nothing explodes I'd add the other two (2002 and 2003).

@crusnov if you have time let's do it this week or the next!

@crusnov if you have time let's do it this week or the next!

Yes (thank you for the ping), let's do it first thing tomorrow, I'll ping on IRC when I add the DNS and deploy it.

@crusnov if you have time let's do it this week or the next!

Yes (thank you for the ping), let's do it first thing tomorrow, I'll ping on IRC when I add the DNS and deploy it.

I have deployed the AAAA DNS for kafka-main2001

All good from the kafka-main2001 side! We can enable it everywhere

@crusnov we are good to deploy the other AAAA records, can we proceed?

elukey claimed this task.
elukey updated the task description. (Show Details)

Added the remaining AAAA records for kafka-main200[2-5]!

Volans reopened this task as Open.EditedJul 5 2022, 7:56 AM

Some clusters managed by the Infrastructure Foundations team have inconsistent AAAA DNS records for the primary IPv6 of the hosts. Some hosts have the AAAA record in the DNS for their primary IPv6 address, some don't.
See https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters for more details about the possible risks of the current setup and the two alternative actions to move forward.

This is the list of the affected clusters and related hosts as of 04/07/2022:

  • ganeti*:
    • have the AAAA record: ganeti[1024-1032,2025-2030,3001-3003,4001-4004,5002,6001-6004]
    • lack the AAAA record: ganeti[1005-1023,2007-2024,5001,5003]

Updated list of ganeti hosts without AAAA records (all the others have them): ganeti[1009-1022,2009-2024]

Full list of hosts without AAAA records for A:owner-infrastructure-foundations

ganeti[2017-2024].codfw.wmnet,ganeti[1009,1011-1012,1014-1018,1020-1022].eqiad.wmnet,seaborgium.wikimedia.org,serpens.wikimedia.org

@MoritzMuehlenhoff Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

MoritzMuehlenhoff claimed this task.

These are all done, the remaining Ganeti nodes w/o AAAA records were decommissioned as part of the last hardware refreshes in eqiad and codfw.