Page MenuHomePhabricator

Some WMCS clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts
Closed, ResolvedPublic

Description

Some clusters managed by the Cloud Services team have inconsistent AAAA DNS records for the primary IPv6 of the hosts. Some hosts have the AAAA record in the DNS for their primary IPv6 address, some don't.
See https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters for more details about the possible risks of the current setup and the two alternative actions to move forward.

This is the list of the affected clusters and related hosts as of 04/07/2022:

  • cloudbackup*: DONE (added AAAA record)
    • have the AAAA record: cloudbackup100[3-4]
    • lack the AAAA record: cloudbackup200[1-2]
  • cloudcephmon*: DONE (added AAAA record)
    • have the AAAA record: cloudcephmon1003,cloudcephmon[2004-2006]-dev
    • lack the AAAA record: cloudcephmon100[1-2]
  • cloudcephosd*: DONE (added AAAA record)
    • have the AAAA record: cloudcephosd200[1-3]-dev,cloudcephosd10[16-34]
    • lack the AAAA record: cloudcephosd[1001-1015]
  • cloudnet*: DONE (cloudnet[1003-1004] don't exist anymore)
    • have the AAAA record: cloudnet[2005-2006]-dev,cloudnet[1005-1006]
    • lack the AAAA record: cloudnet[1003-1004]
  • cloudvirt*: DONE (added AAAA record)
    • have the AAAA record: cloudvirt[1019-1022,1025-1026,1028-1030,1040-1053]
    • lack the AAAA record: cloudvirt[2001-2003]-dev,cloudvirt[1017,1023-1024,1027,1030-1039]

Event Timeline

Added dns AAAA record to cloudcephosd10[01-15]

Added AAAA record for coludbockup and cloudcephmon, the rest will need more careful checking

@Andrew Can you give it a look too? may be related to the issues you were seeing with auth timeouts?

Clouddb: I know little to nothing about these servers. Most sql grants are only configured for ipv4, so it's possible that adding AAAA records will cause access failures. Note, though, that this is not a mixed cluster: the clouddb hosts in codfw server an entirely different purpose from those in eqiad. Bad naming :(

Cloudnet100[34]: It's definitely true that the service running on this is ipv4 only, but since the counterparts in codfw1dev work it should be safe to add.

It should be safe to add the cloudvirt records.

What's involved in adding these records? Is it just filling in a field in netbox or are there separate dns steps needed?

What's involved in adding these records? Is it just filling in a field in netbox or are there separate dns steps needed?

As mentioned in the task description, it's all outlined in the wikitech page: https://wikitech.wikimedia.org/wiki/DNS/Netbox#Mixed_clusters that would point you to https://wikitech.wikimedia.org/wiki/DNS/Netbox#Add_missing_DNS_name_to_the_primary_IPv6_address where, based if you need to add to few hosts or many hosts, there are two different approaches.

@Andrew the outlined process required 2 steps. You just performed step 1 (adding the DNS names in Netbox) and not step 2 (running the sre.dns.netbox cookbook).
The lack of running step 2 is causing various issues:

  1. The changes have not been deployed to the DNS so they are not live and we don't know if they are ok or might cause issues to the services on those hosts
  2. It caused Icinga to alert:
Thu 22:25:33   icinga-wm| PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
  1. It's currently de-facto blocking any other Netbox changes to be propagated via the sre.dns.netbox cookbook without incurring in the risk of breaking the services on those hosts
    • In particular when people run the provision or decommissioning workflows expect a DNS diff when running the cookbook to just have the affected hosts and not spurious changes. In those cases usually the operator stops and have to go around and ask to know if it's safe to proceed or not, blocking them.

I'm raising this to the WMCS folks in EU timezones to see if we can merge those changes and unblock other changes.

FYI the current diff is P33742

@Volans I've now run the sre.dns.netbox cookbook, which completed successfully.

Andrew updated the task description. (Show Details)

Note that the clouddb AAAA records can't be added yet, blocked by T270101

Just so I'm up to speed -- did you remove the AAAA records that I added before my break?

dcaro claimed this task.

Just so I'm up to speed -- did you remove the AAAA records that I added before my break?

Partially yes, it turns out, that as you suspected, the clouddb* hosts have the grants set only for IPv4 addresses, and the clients started using the IPv6 by default, failing to connect to the DB server. So I removed the AAAA entries for the clouddb* hosts (more details in T323550: clouddb* hosts with ipv6 access timeout from cumin).

Also linked the task on the DBA side to setup those grants that will unblock this one (T270101: Grants not working with DB hosts with to ipv6).

dcaro removed dcaro as the assignee of this task.
Volans claimed this task.

There are no more inconsistent cluster, all the above have been resolved.

The only left hosts are I think expected to not support AAAA records for now:

an-redacteddb1001.eqiad.wmnet,clouddb2002-dev.codfw.wmnet,clouddb[1013-1020].eqiad.wmnet