Page MenuHomePhabricator

Some clusters do not have DNS for IPv6 addresses (TRACKING TASK)
Open, MediumPublic

Description

As part of our ongoing effort to import DNS data into Netbox for automation, we have done a procedure to identify hosts with missing IPv6 DNS entries, with the suspicion that these are potentially sources of problems.

The procedure to produce this data was to look at the PuppetDB networking key for the host and select the "primary" IP address for the host, and do a DNS lookup for the FQDN for the host to check for a AAAA record, and report on any missing ones.

Below is a list of the machines (not VMs) without IPv6 DNS grouped by responsible team:

Foundations:

  • kafka-main[2001-2005].codfw.wmnet

Observability:

  • centrallog2001.codfw.wmnet
  • centrallog1001.eqiad.wmnet
  • graphite2003.codfw.wmnet
  • graphite1004.eqiad.wmnet
  • logstash[2001-2003,2026-2029].codfw.wmnet
  • logstash[1010-1012,1026-1029].eqiad.wmnet
  • thanos-fe[2001-2003].codfw.wmnet
  • mwlog2001.codfw.wmnet
  • mwlog1001.eqiad.wmnet

WMCS:

  • cloudbackup[2001-2002].codfw.wmnet
  • cloudnet2003-dev.codfw.wmnet
  • cloudvirt[2001-2003]-dev.codfw.wmnet
  • cloudvirt[1001-1009,1012-1018,1023-1030].eqiad.wmnet
  • cloudvirt-wdqs[1001-1003].eqiad.wmnet
  • labsdb[1009-1012].eqiad.wmnet
  • labtestvirt2003.codfw.wmnet
  • dbproxy[2001-2004].codfw.wmnet
  • dbproxy[1003,1008,1012-1021].eqiad.wmnet
  • dbstore[1003-1005].eqiad.wmnet

Data Persistence:

  • ms-be[2016-2056].codfw.wmnet - can support
  • ms-be[1016-1026,1028-1059].eqiad.wmnet
  • ms-fe[2005-2008].codfw.wmnet
  • ms-fe[1005-1008].eqiad.wmnet

Serviceops:

  • dumpsdata[1001-1003].eqiad.wmnet
  • rdb[2003-2006].codfw.wmnet
  • rdb[1005-1006,1009-1010].eqiad.wmnet
  • snapshot[1005-1010].eqiad.wmnet
  • scandium.eqiad.wmnet

Search:

  • elastic[2025-2060].codfw.wmnet
  • elastic[1032-1067].eqiad.wmnet
  • relforge[1001-1002].eqiad.wmnet
  • wdqs[2001-2008].codfw.wmnet
  • wdqs[1004-1013].eqiad.wmnet

Traffic:

  • lvs[2007-2010].codfw.wmnet
  • lvs[1013-1016].eqiad.wmnet
  • lvs[4005-4007].ulsfo.wmnet

???:

  • es[2011-2025].codfw.wmnet
  • es[1015-1025].eqiad.wmnet
  • maps[2001-2004].codfw.wmnet
  • maps[1001-1004].eqiad.wmnet

Fixed, will not proceed, or removed:

  • flerovium.eqiad.wmnet
  • francium.eqiad.wmnet
  • furud.codfw.wmnet
  • heze.codfw.wmnet - being deprecated
  • kafka-jumbo[1007-1009].eqiad.wmnet - ipv6 names these are done https://phabricator.wikimedia.org/T185262 verified in netbox
  • labstore[1004-1005].eqiad.wmnet - ipv6 names verified in netbox
  • mw[2135-2147,2150-2212,2214,2256] - machines offline
  • puppetmaster2003.codfw.wmnet
  • sretest[1001-1002].eqiad.wmnet
  • stat1008.eqiad.wmnet
  • thanos-be[2001-2004].codfw.wmnet - ipv6 names verified in netbox
  • theemin.codfw.wmnet
  • tungsten.eqiad.wmnet

Foundations:

  • auth2001.codfw.wmnet
  • auth1002.eqiad.wmnet
  • ganeti[2001-2024].codfw.wmnet - waiting for upgrade to buster
  • ganeti[1001-1022].eqiad.wmnet
  • ganeti[4001-4003].ulsfo.wmnet
  • ping1001
  • ping2001
  • ping3001

Analytics:

  • an-druid[1001-1002].eqiad.wmnet
  • druid[1007-1008].eqiad.wmnet
  • notebook[1003-1004].eqiad.wmnet (removed)

Machine learning:

  • ores[2001-2009].codfw.wmnet - will be replaced soon
  • ores[1001-1009].eqiad.wmnet
  • oresrdb2002.codfw.wmnet - removed
  • oresrdb[1001-1002].eqiad.wmnet

Serviceops:

  • kubernetes[1007-1014].eqiad.wmnet
  • mc[2019-2036].codfw.wmnet
  • mc[1019-1036].eqiad.wmnet
  • mc-gp[2001-2003].codfw.wmnet
  • mc-gp[1001-1003].eqiad.wmnet
  • mw[2215-2334,2350-2254,2257-2376].codfw.wmnet
  • mw[1261-1279,1281-1290,1293-1413].eqiad.wmnet
  • parse[2001-2020].codfw.wmnet
  • restbase[2009-2023].codfw.wmnet
  • restbase[1016-1030].eqiad.wmnet
  • restbase-dev[1004-1006].eqiad.wmnet
  • scb[2001-2006].codfw.wmnet
  • scb[1001-1004].eqiad.wmnet
  • sessionstore[2001-2003].codfw.wmnet
  • sessionstore[1001-1003].eqiad.wmnet
  • thumbor[2001-2004].codfw.wmnet
  • thumbor[1001-1004].eqiad.wmnet
  • wtp[2001-2020].codfw.wmnet
  • wtp[1025-1048].eqiad.wmnet

Data Persistence:

  • db[2071-2140].codfw.wmnet - these are blocked by T270101 and won't be pursued now.
  • db[1074-1139,1141-1149].eqiad.wmnet - as above
  • dbprov[2001-2002].codfw.wmnet - as above
  • dbprov[1001-1002].eqiad.wmnet
  • pc[2007-2010].codfw.wmnet
  • pc[1007-1010].eqiad.wmnet

Related Objects

Event Timeline

Change 599769 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] puppetmaster2003: add AAAA

https://gerrit.wikimedia.org/r/599769

Change 599769 merged by Jbond:
[operations/dns@master] puppetmaster2003: add AAAA

https://gerrit.wikimedia.org/r/599769

Change 599779 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] AAAA: for flerovium and furud

https://gerrit.wikimedia.org/r/599779

Change 599783 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile/manifests/dumps: enable ipv6 drop ferm rule for 443

https://gerrit.wikimedia.org/r/599783

Change 599798 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] theemin.codfw.wmnet: add AAAA

https://gerrit.wikimedia.org/r/599798

Change 599803 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] mongodb: enable ipv6

https://gerrit.wikimedia.org/r/599803

Change 599798 merged by Jbond:
[operations/dns@master] theemin.codfw.wmnet: add AAAA

https://gerrit.wikimedia.org/r/599798

Change 599803 abandoned by Jbond:
mongodb: enable ipv6

Reason:
mongodb is going away as per moritz comment

https://gerrit.wikimedia.org/r/599803

tungsten.eqiad.wmnet

Currently mongodb is not listening on IPv6 however mongodb is going away so we should wait until that work has been completed

when i checked stat1008 already had a AAAA record, not sure if someone fixed it or some issue in the script?

Change 599813 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/dns@master] sretest: add AAAA records

https://gerrit.wikimedia.org/r/599813

Change 599813 merged by Jbond:
[operations/dns@master] sretest: add AAAA records

https://gerrit.wikimedia.org/r/599813

Change 599779 merged by Jbond:
[operations/dns@master] AAAA: for flerovium and furud

https://gerrit.wikimedia.org/r/599779

Change 599783 merged by Jbond:
[operations/puppet@production] profile/manifests/dumps: enable ipv6 drop ferm rule for 443

https://gerrit.wikimedia.org/r/599783

Dzahn added a subscriber: Dzahn.

tungsten has been decom'ed today. checked the box because it's gone

@jbond now that everything is in Netbox, would it make sens to have a Netbox report that shows the hosts that have a primary v4 IP and primary v6 without DNS name? Eventually grouping them by prefix (types).
Non-alerting of course.

I have rejiggered the list in the task description to cross-reference with responsible teams. @Dzahn @Volans could you fill in some gaps as to who is responsible for the ??? hosts? Thanks.

I have gone through and checked each of the clusters and updated their status. Most have IPv6 but no DNS. Some have been fixed in the intervening time.

scandium is a testing rig for parsoid, so it can move to serviceops (where parse*/wtp* servers are it belongs as well)

heze is a backup server (offsite), so I think that would be data persistence.

maps .. <unknown value>

Can you please add the steps needed in Netbox to make it generate a DNS record for a server which handles IPv6 fine to the task description? We use enable_ip6_mapped by default for practically all servers these days. Is this just a matter of adding of setting "DNS Name" in the interface to the FQDN and running the sre.dns.netbox cookbook or is there more to it?

Is this just a matter of adding of setting "DNS Name" in the interface to the FQDN and running the sre.dns.netbox cookbook or is there more to it?

It's exactly just that. When all the data was imported from the existing DNS into netbox, the DNS Name for IPv6 that didn't have an AAAA/PTR record were skipped to make sure to generate from Netbox the same data the manual DNS repo had.

If you do that make sure that the service is correctly configured to handle traffic on v6 (listen, ferm, grants, etc...).

@jbond now that everything is in Netbox, would it make sens to have a Netbox report that shows the hosts that have a primary v4 IP and primary v6 without DNS name? Eventually grouping them by prefix (types).
Non-alerting of course.

Sorry missed this but yes i think it would

Is this just a matter of adding of setting "DNS Name" in the interface to the FQDN and running the sre.dns.netbox cookbook or is there more to it?

Just to clarify that yes this is correct however the risk is that not all services are listening on the ipv6 addresses

Since the maps servers are being replaced? I think? Perhaps we can cross them off for this project. Am I right in that this is happening?

joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 4:03 PM
Volans changed the task status from In Progress to Open.Dec 6 2021, 5:37 PM
Volans moved this task from In Progress to Blocked on the Infrastructure-Foundations board.