Page MenuHomePhabricator

DNS race in WMCS VPS when deleting and re-provisioning an instance with the same name
Open, MediumPublic

Description

Namely when deleting and re-provisioning an instance with the same name, the DNS answers seem to alternate between nxdomain and the ip address (tested ~5 min after creation of the instance)

filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
Host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
Host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
Host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
Host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
Host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
Host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud not found: 3(NXDOMAIN)
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
filippo@pontoon-thanos-02:~$

resolv.conf looks like this:

filippo@pontoon-thanos-02:~$ cat /etc/resolv.conf 
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain monitoring.eqiad1.wikimedia.cloud
search monitoring.eqiad1.wikimedia.cloud eqiad1.wikimedia.cloud 
nameserver 208.80.154.143
nameserver 208.80.154.24
options timeout:1 ndots:1

The instance's new address is 172.16.0.213 which after a little while starts appearing as the answer, although the old IP is still returned sometimes:

filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.0.213
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.0.213
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129
filippo@pontoon-thanos-02:~$ host pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud 
pontoon-ms-be-01.monitoring.eqiad1.wikimedia.cloud has address 172.16.1.129

I couldn't find documentation on the expected timings for when the new address will be fully propagated, what's the expectation? Thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew triaged this task as Medium priority.Oct 20 2020, 4:28 PM
Andrew subscribed.

this is a known issue with async dns record creation/removal. Attached bug is the proper solution; I hope to work on that after we move forward a few more OpenStack releases.