Page MenuHomePhabricator

Some clusters do not have DNS for IPv6 addresses (TRACKING TASK)
Open, MediumPublic

Description

As part of our ongoing effort to import DNS data into Netbox for automation, we have done a procedure to identify hosts with missing IPv6 DNS entries, with the suspicion that these are potentially sources of problems.

The procedure to produce this data was to look at the PuppetDB networking key for the host and select the "primary" IP address for the host, and do a DNS lookup for the FQDN for the host to check for a AAAA record, and report on any missing ones.

Below is a list of the machines (not VMs) without IPv6 DNS grouped by responsible team:

Foundations:

  • kafka-main[2001-2005].codfw.wmnet

Observability:

  • centrallog2001.codfw.wmnet
  • centrallog1001.eqiad.wmnet
  • graphite2003.codfw.wmnet
  • graphite1004.eqiad.wmnet
  • logstash[2001-2003,2026-2029].codfw.wmnet
  • logstash[1010-1012,1026-1029].eqiad.wmnet
  • mwlog2001.codfw.wmnet
  • mwlog1001.eqiad.wmnet

WMCS:

  • cloudbackup[2001-2002].codfw.wmnet
  • cloudnet2003-dev.codfw.wmnet
  • cloudvirt[2001-2003]-dev.codfw.wmnet
  • cloudvirt[1001-1009,1012-1018,1023-1030].eqiad.wmnet
  • cloudvirt-wdqs[1001-1003].eqiad.wmnet
  • labsdb[1009-1012].eqiad.wmnet
  • labtestvirt2003.codfw.wmnet
  • dbproxy[2001-2004].codfw.wmnet
  • dbproxy[1003,1008,1012-1021].eqiad.wmnet
  • dbstore[1003-1005].eqiad.wmnet

Data Persistence:

  • ms-be[2016-2056].codfw.wmnet - can support
  • ms-be[1016-1026,1028-1059].eqiad.wmnet
  • ms-fe[2005-2008].codfw.wmnet
  • ms-fe[1005-1008].eqiad.wmnet

Serviceops:

Search:

  • elastic[2025-2060].codfw.wmnet
  • elastic[1032-1067].eqiad.wmnet
  • relforge[1001-1002].eqiad.wmnet
  • wdqs[2001-2008].codfw.wmnet
  • wdqs[1004-1013].eqiad.wmnet

???:

  • es[2011-2025].codfw.wmnet
  • es[1015-1025].eqiad.wmnet
  • maps[2001-2004].codfw.wmnet
  • maps[1001-1004].eqiad.wmnet

Fixed, will not proceed, or removed:

  • flerovium.eqiad.wmnet
  • francium.eqiad.wmnet
  • furud.codfw.wmnet
  • heze.codfw.wmnet - being deprecated
  • kafka-jumbo[1007-1009].eqiad.wmnet - ipv6 names these are done https://phabricator.wikimedia.org/T185262 verified in netbox
  • labstore[1004-1005].eqiad.wmnet - ipv6 names verified in netbox
  • mw[2135-2147,2150-2212,2214,2256] - machines offline
  • puppetmaster2003.codfw.wmnet
  • sretest[1001-1002].eqiad.wmnet
  • stat1008.eqiad.wmnet
  • thanos-be[2001-2004].codfw.wmnet - ipv6 names verified in netbox
  • theemin.codfw.wmnet
  • tungsten.eqiad.wmnet

Traffic:

  • lvs[2007-2010].codfw.wmnet
  • lvs[1013-1016].eqiad.wmnet
  • lvs[4005-4007].ulsfo.wmnet

Foundations:

  • auth2001.codfw.wmnet
  • auth1002.eqiad.wmnet
  • ganeti[2001-2024].codfw.wmnet - waiting for upgrade to buster
  • ganeti[1001-1022].eqiad.wmnet
  • ganeti[4001-4003].ulsfo.wmnet
  • ping1001
  • ping2001
  • ping3001

Analytics:

  • an-druid[1001-1002].eqiad.wmnet
  • druid[1007-1008].eqiad.wmnet
  • notebook[1003-1004].eqiad.wmnet (removed)

Machine learning:

  • ores[2001-2009].codfw.wmnet - will be replaced soon
  • ores[1001-1009].eqiad.wmnet
  • oresrdb2002.codfw.wmnet - removed
  • oresrdb[1001-1002].eqiad.wmnet

Serviceops:

  • kubernetes[1007-1014].eqiad.wmnet
  • mc[2019-2036].codfw.wmnet
  • mc[1019-1036].eqiad.wmnet
  • mc-gp[2001-2003].codfw.wmnet
  • mc-gp[1001-1003].eqiad.wmnet
  • mw[2215-2334,2350-2254,2257-2376].codfw.wmnet
  • mw[1261-1279,1281-1290,1293-1413].eqiad.wmnet
  • parse[2001-2020].codfw.wmnet
  • restbase[2009-2023].codfw.wmnet
  • restbase[1016-1030].eqiad.wmnet
  • restbase-dev[1004-1006].eqiad.wmnet
  • scb[2001-2006].codfw.wmnet
  • scb[1001-1004].eqiad.wmnet
  • sessionstore[2001-2003].codfw.wmnet
  • sessionstore[1001-1003].eqiad.wmnet
  • thumbor[2001-2004].codfw.wmnet
  • thumbor[1001-1004].eqiad.wmnet
  • wtp[2001-2020].codfw.wmnet
  • wtp[1025-1048].eqiad.wmnet
  • rdb[2003-2006].codfw.wmnet
  • rdb[1005-1006,1009-1010].eqiad.wmnet
  • scandium.eqiad.wmnet

Data Persistence:

  • db[2071-2140].codfw.wmnet - these are blocked by T270101 and won't be pursued now.
  • db[1074-1139,1141-1149].eqiad.wmnet - as above
  • dbprov[2001-2002].codfw.wmnet - as above
  • dbprov[1001-1002].eqiad.wmnet
  • pc[2007-2010].codfw.wmnet
  • pc[1007-1010].eqiad.wmnet

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

when i checked stat1008 already had a AAAA record, not sure if someone fixed it or some issue in the script?

Change 599813 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] sretest: add AAAA records

https://gerrit.wikimedia.org/r/599813

Change 599813 merged by Jbond:
[operations/dns@master] sretest: add AAAA records

https://gerrit.wikimedia.org/r/599813

Change 599779 merged by Jbond:
[operations/dns@master] AAAA: for flerovium and furud

https://gerrit.wikimedia.org/r/599779

Change 599783 merged by Jbond:
[operations/puppet@production] profile/manifests/dumps: enable ipv6 drop ferm rule for 443

https://gerrit.wikimedia.org/r/599783

Dzahn subscribed.

tungsten has been decom'ed today. checked the box because it's gone

@jbond now that everything is in Netbox, would it make sens to have a Netbox report that shows the hosts that have a primary v4 IP and primary v6 without DNS name? Eventually grouping them by prefix (types).
Non-alerting of course.

I have rejiggered the list in the task description to cross-reference with responsible teams. @Dzahn @Volans could you fill in some gaps as to who is responsible for the ??? hosts? Thanks.

I have gone through and checked each of the clusters and updated their status. Most have IPv6 but no DNS. Some have been fixed in the intervening time.

scandium is a testing rig for parsoid, so it can move to serviceops (where parse*/wtp* servers are it belongs as well)

heze is a backup server (offsite), so I think that would be data persistence.

maps .. <unknown value>

Can you please add the steps needed in Netbox to make it generate a DNS record for a server which handles IPv6 fine to the task description? We use enable_ip6_mapped by default for practically all servers these days. Is this just a matter of adding of setting "DNS Name" in the interface to the FQDN and running the sre.dns.netbox cookbook or is there more to it?

Is this just a matter of adding of setting "DNS Name" in the interface to the FQDN and running the sre.dns.netbox cookbook or is there more to it?

It's exactly just that. When all the data was imported from the existing DNS into netbox, the DNS Name for IPv6 that didn't have an AAAA/PTR record were skipped to make sure to generate from Netbox the same data the manual DNS repo had.

If you do that make sure that the service is correctly configured to handle traffic on v6 (listen, ferm, grants, etc...).

@jbond now that everything is in Netbox, would it make sens to have a Netbox report that shows the hosts that have a primary v4 IP and primary v6 without DNS name? Eventually grouping them by prefix (types).
Non-alerting of course.

Sorry missed this but yes i think it would

Is this just a matter of adding of setting "DNS Name" in the interface to the FQDN and running the sre.dns.netbox cookbook or is there more to it?

Just to clarify that yes this is correct however the risk is that not all services are listening on the ipv6 addresses

Since the maps servers are being replaced? I think? Perhaps we can cross them off for this project. Am I right in that this is happening?

joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 4:03 PM
Volans changed the task status from In Progress to Open.Dec 6 2021, 5:37 PM
Volans moved this task from In Progress to Blocked on the Infrastructure-Foundations board.

speaking for scandium.eqiad.wmnet. that is a parsoid test server and while used by the parsoid team, is not critical for production. imho it can just be added. the other servers listed for serviceops I'm not sure if they are actually owned by serviceops.

@MoritzMuehlenhoff I see that ganeti[2009-2024] and ganeti[1009-1022] are lacking AAAA records while the rest have it. Can we add them to the rest of the cluster?

@MoritzMuehlenhoff I see that ganeti[2009-2024] and ganeti[1009-1022] are lacking AAAA records while the rest have it. Can we add them to the rest of the cluster?

I don't see why not. Best to first start with codfw only, maybe.

From sudo cumin ganeti[1009-1022].eqiad.wmnet 'ip -6 addr | grep "scope global" | grep -v dynamic (and the same in codfw).

Those hosts only have a SLAAC v6 IP: ganeti[2010,2016,2020].codfw.wmnet so the same procedure as on T353254 needs to be done first.

Those hosts only have a SLAAC v6 IP: ganeti[2010,2016,2020].codfw.wmnet so the same procedure as on T353254 needs to be done first.

I'll convert these in January. When that is done, we can proceed with the AAAA records.