Page MenuHomePhabricator

Some labs instances IP have multiple PTR entries in DNS
Closed, ResolvedPublic

Description

krenair@bastion-01:~$ host 10.68.18.65
65.18.68.10.in-addr.arpa domain name pointer testlabs-createtest2.testlabs.eqiad.wmflabs.
65.18.68.10.in-addr.arpa domain name pointer bastion-02.bastion.eqiad.wmflabs.
krenair@bastion-01:~$ ping bastion-02
PING bastion-02.bastion.eqiad.wmflabs (10.68.18.65) 56(84) bytes of data.
64 bytes from bastion-02.bastion.eqiad.wmflabs (10.68.18.65): icmp_seq=1 ttl=64 time=0.480 ms
64 bytes from bastion-02.bastion.eqiad.wmflabs (10.68.18.65): icmp_seq=2 ttl=64 time=0.492 ms
krenair@tools-bastion-01:~$ ping bastion-02
PING bastion-02.eqiad.wmflabs (10.68.18.65) 56(84) bytes of data.
64 bytes from testlabs-createtest2.testlabs.eqiad.wmflabs (10.68.18.65): icmp_seq=1 ttl=64 time=0.403 ms
64 bytes from testlabs-createtest2.testlabs.eqiad.wmflabs (10.68.18.65): icmp_seq=2 ttl=64 time=0.582 ms

Also this one:

krenair@bastion-01:~$ host 10.68.17.12
12.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-22792.contintcloud.eqiad.wmflabs.
12.17.68.10.in-addr.arpa domain name pointer analytics203.analytics.eqiad.wmflabs.
krenair@bastion-01:~$ ping analytics203
PING analytics203.eqiad.wmflabs (10.68.17.12) 56(84) bytes of data.
64 bytes from ci-jessie-wikimedia-22792.contintcloud.eqiad.wmflabs (10.68.17.12): icmp_seq=1 ttl=64 time=1.60 ms
64 bytes from ci-jessie-wikimedia-22792.contintcloud.eqiad.wmflabs (10.68.17.12): icmp_seq=2 ttl=64 time=0.487 ms
krenair@tools-bastion-01:~$ ping analytics203
PING analytics203.eqiad.wmflabs (10.68.17.12) 56(84) bytes of data.
64 bytes from analytics203.analytics.eqiad.wmflabs (10.68.17.12): icmp_seq=1 ttl=64 time=0.420 ms
64 bytes from analytics203.analytics.eqiad.wmflabs (10.68.17.12): icmp_seq=2 ttl=64 time=0.405 ms

Event Timeline

Krenair raised the priority of this task from to Needs Triage.
Krenair updated the task description. (Show Details)
Krenair subscribed.
yuvipanda renamed this task from 10.68.18.65 resolves to two different instances to RDNS for 10.68.18.65 resolves to two different instances.Oct 10 2015, 9:25 PM
yuvipanda set Security to None.
chasemp subscribed.
Krenair renamed this task from RDNS for 10.68.18.65 resolves to two different instances to RDNS for some labs instance IPs resolve to multiple different instances.Feb 9 2016, 3:48 PM
Krenair updated the task description. (Show Details)
Krenair added a subscriber: Andrew.
hashar renamed this task from RDNS for some labs instance IPs resolve to multiple different instances to Some labs instances IP have multiple PTR entries in DNS.Feb 11 2016, 8:17 PM
krenair@bastion-01:~$ host 10.68.16.66
;; Truncated, retrying in TCP mode.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-53082.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-75884.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-74560.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-69512.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78428.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-67335.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-75977.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-80668.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-69168.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78397.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78788.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-67383.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-62884.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-59032.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer sm-puppetmaster-trusty2.servermon.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-74567.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-64645.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-76015.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-76473.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78683.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-77527.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-67103.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-65765.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-68088.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-76857.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-64125.contintcloud.eqiad.wmflabs.

So far there's been one instance in testlabs and quite a lot in contintcloud
@hashar: please point me to the code that sets these contintcloud instances up and deletes them

The system is Nodepool which request creation and deletion of instances via the OpenStack API end point. It creates hundreds of instances per day which would explain why it shows up more often.

Date# of creation requests
2016-04-07701
2016-04-08649
2016-04-09196
2016-04-10185
2016-04-11733
2016-04-121095
2016-04-13919
2016-04-14932
2016-04-15633
2016-04-16102
2016-04-17137
2016-04-18788
2016-04-191112
2016-04-20736
2016-04-21797

I have no idea how DNS is provisioned

https://wikitech.wikimedia.org/wiki/Nodepool

<andrewbogott> Krenair: It's a hack, but I tend to put those things in sink plugins, since sink is already in charge of cleaning up dns entries.
<andrewbogott> (The designate people were just talking about how sink sometimes drops things so they're moving that to a more tightly integrated system… but we won't be using that anytime soon)
<Krenair> is that why we have an issue with broken PTR records?
<andrewbogott> Krenair: It could be from leaks, yes.

From T126518

It is back around :(

[21:50:04]  <mutante>	dzahn@bastion-restricted-01:~$ host 10.68.16.66
[21:50:04]  <mutante>	;; Truncated, retrying in TCP mode.
[21:50:04]  <mutante>	66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-53082.contintcloud.eqiad.wmflabs.
[21:50:06]  <mutante>	66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-77527.contintcloud.eqiad.wmflabs.
[21:50:09]  <mutante>	66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-75884.contintcloud.eqiad.wmflabs.
[21:50:09]  <mutante>	< 27 lines >

I have no idea whether forward entries leak as well. Maybe we could dump the contintcloud.eqiad.wmflabs zone and see how many entries there is? Should be less than 20, the quota for that tenant.

I can't AXFR though, so little things I can investigate without brute forcing DNS requests

$ dig AXFR contintcloud.eqiad.wmflabs. @labs-ns0.wikimedia.org.
Transfer failed.

there are relatively many ldap connection failures in the sink log. That fits with the fact that our designate setup is subject to periodic OOMs. I need to have another look at the sink code and verify that other plugins still run if one of them errors out.

It would also be useful to know if we are leaking A records that correspond to the leaked PTR records.

I wrote a stupid resolver for the A records:

1#!/usr/bin/env python2
2
3import argparse
4import dns.resolver
5from time import sleep
6
7p = argparse.ArgumentParser()
8p.add_argument('--delay', type=float, default=1.0,
9 help='Delay between DNS queries')
10p.add_argument('range', help='Range of ID. Ex: 85-90')
11opts = p.parse_args()
12
13try:
14 (start, stop) = [int(border) for border in opts.range.split('-')]
15except ValueError:
16 p.error("Invalid range should be: <start>-<end>")
17
18fqdn_template = 'ci-jessie-wikimedia-%s.contintcloud.eqiad.wmflabs.'
19
20print "Start: %s" % (fqdn_template % start)
21print "Stop: %s" % (fqdn_template % stop)
22
23print "Querying DNS for A records ..."
24for index, host_id in enumerate(xrange(start, stop), start=1):
25 fqdn = fqdn_template % host_id
26 try:
27 answers = dns.resolver.query(fqdn, 'A')
28 print "Found %s" % fqdn
29 except dns.resolver.NXDOMAIN:
30 continue
31 finally:
32 sleep(opts.delay)
33print "Did %s queries. Done." % index

Running it right now from deployment-tin and range 70000-85000.

Out of 15000 A entries, only one leaked:

$ python blam.py --delay 0.1 70000-85000
Start: ci-jessie-wikimedia-70000.contintcloud.eqiad.wmflabs.
Stop:  ci-jessie-wikimedia-85000.contintcloud.eqiad.wmflabs.
Querying DNS for A records ...
Found ci-jessie-wikimedia-70569.contintcloud.eqiad.wmflabs.
Did 15000 queries. Done.

While looking at T99072: Fix 'unknown's in shinken I found another:

krenair@tools-bastion-03:~$ host 10.68.16.97
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-59441.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-47624.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-49202.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-52995.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-59269.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-51209.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-64225.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-51020.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer tools-worker-1011.tools.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-47309.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-60973.contintcloud.eqiad.wmflabs.

This is also a case of T134025

krenair@tools-bastion-03:~$ host 10.68.17.58
;; Truncated, retrying in TCP mode.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-104356.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-117884.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87761.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-115559.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-112332.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-50058.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-118839.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-117685.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-111507.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-72171.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-96703.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-117105.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-94967.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-117256.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-103395.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-108704.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-117855.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-112002.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-106514.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-120291.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-102473.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-113082.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-100550.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-119443.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer deployment-salt02.deployment-prep.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-112763.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-115063.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-114045.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87738.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-118180.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-110276.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87655.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-88878.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-111738.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-120337.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-104783.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-94612.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-46539.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-85918.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-118867.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-108158.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-119834.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-107152.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-98053.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-83531.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-102838.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-109855.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-114052.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-113735.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-103081.contintcloud.eqiad.wmflabs.
krenair@mira:~$ host 10.68.17.146
146.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-140638.contintcloud.eqiad.wmflabs.
146.17.68.10.in-addr.arpa domain name pointer testlabs-horizontest-84926088-3f8c-4db2-9f92-2592cf9a4fcf.extdist.eqiad.wmflabs.
146.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-139291.contintcloud.eqiad.wmflabs.
146.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-154196.contintcloud.eqiad.wmflabs.
<mutante> 121.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-47938.contintcloud.eqiad.wmflabs.
<mutante> 121.16.68.10.in-addr.arpa domain name pointer petscan1.petscan.eqiad.wmflabs.

Until the DNS leak is identified entries will keep leaking. It is quite easy to retrieve all of them from the Designate database, so there is no need to reply with any PTR entries you might find.

Ok, except @Andrew just told me to tack these on... :)

otto@deployment-kafka03:~$ host 10.68.16.138
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87841.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-90123.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-89281.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-88514.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87903.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer deployment-kafka03.deployment-prep.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-89145.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-91084.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-90612.contintcloud.eqiad.wmflabs.

@Ottomata Fair call sorry :-)

Nodepool spawns instances with an incremental ID to give some indication about the progress:

Time (UTC)ID
2016-03-1450000
2016-03-2260000
2016-04-0470000
2016-04-2080000
2016-05-0490000
2016-05-13100000
2016-05-31125000
2016-06-17150000
2016-06-21 7am152745
2016-06-26 midnight159569
2016-07-11 midnight175997

From the ID pasted previously, maybe that got partially fixed / mitigated? The designate database definitely has a ton of entries for the contintcloud project. We could get an exhaustive list by doing an SQL query there.

I've been thinking maybe we should try nodepool in labtest (running at a much smaller scale) so we can take a closer look at this...

I've been thinking maybe we should try nodepool in labtest (running at a much smaller scale) so we can take a closer look at this...

One could write a script that brute spawn/delete instances in a loop against labtest. https://pypi.python.org/pypi/shade should make that straightforward :]

there are relatively many ldap connection failures in the sink log. That fits with the fact that our designate setup is subject to periodic OOMs. I need to have another look at the sink code and verify that other plugins still run if one of them errors out.

I'm not sure about what happens when nova_ldap errors out, but nova_fixed_multi should be catching everything and dumping it to the log.

Here are some relatively recent issues on labtestservices:

root@labtestservices2001:/var/log/designate# zcat designate-sink.log.*.gz | grep -v "File \"" | grep -v "     " | grep -v Traceback | grep -v "\[req-"| grep -o "oslo_messaging.notify.dispatcher [^$].*" | sort -d | uniq
oslo_messaging.notify.dispatcher LDAPError: LDAP connection invalid
oslo_messaging.notify.dispatcher NotFound: Instance 18b898ae-818a-4b5e-81fe-79b76ee71d38 could not be found. (HTTP 404)
oslo_messaging.notify.dispatcher NotFound: Instance 32e47f2b-1b0e-4bd2-8721-a5c4fb768e2d could not be found. (HTTP 404)
oslo_messaging.notify.dispatcher NotFound: Instance 353d926f-a863-4d36-94be-4317a9a2e68a could not be found. (HTTP 404)
root@labtestservices2001:/var/log/designate#

I've written a script to hopefully purge the vast majority of problematic entries in T120797

We looked into it last night, but weren't able to find the cause. We do know that the last instance to leave a reverse DNS entry behind was deleted around 2016-09-08 22:46 (@madhuvishy ran select max(nova.instances.deleted_at) from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '8d114f3c815b466cbdd49b91f704ea60' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null; for me, which eventually resulted in 2016-09-08 22:46:45 after the query ran for 10-15 minutes) - you'd expect the DNS entry to be deleted within a minute of that
Unfortunately @madhuvishy found nothing useful from the logs around that time, and I'm not sure whether it's now a historical thing that needs a one-off cleanup, or an ongoing issue.

Here's the extent of the issue in labtest:

1mysql:root@localhost [designate]> select recordsets.name, records.data, nova.instances.deleted_at from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '9b60f3abd64b4e309d6f7535811b0fa8' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null;
2+----------------------------+-----------------------------------------------------+---------------------+
3| name | data | deleted_at |
4+----------------------------+-----------------------------------------------------+---------------------+
5| 67.16.196.10.in-addr.arpa. | puppettest101.labtestproject.codfw.labtest. | 2016-05-16 15:24:28 |
6| 19.16.196.10.in-addr.arpa. | admin-test-instance.admin.codfw.labtest. | 2016-08-05 16:03:39 |
7| 44.16.196.10.in-addr.arpa. | digtest103.labtestproject.codfw.labtest. | 2016-05-12 19:27:21 |
8| 17.16.196.10.in-addr.arpa. | control.labtestproject.codfw.labtest. | 2016-05-11 19:20:19 |
9| 26.16.196.10.in-addr.arpa. | nettest101.labtestproject.codfw.labtest. | 2016-05-11 14:45:56 |
10| 53.16.196.10.in-addr.arpa. | starttest102.labtestproject.codfw.labtest. | 2016-05-13 20:45:16 |
11| 7.16.196.10.in-addr.arpa. | inst109.labtestproject.codfw.labtest. | 2016-04-21 15:24:21 |
12| 22.16.196.10.in-addr.arpa. | overquota-instances-2.labtestproject.codfw.labtest. | 2016-08-23 20:02:30 |
13| 11.16.196.10.in-addr.arpa. | dns-test-110.labtestproject.codfw.labtest. | 2016-07-22 16:31:47 |
14| 13.16.196.10.in-addr.arpa. | inst115.labtestproject.codfw.labtest. | 2016-04-21 20:00:08 |
15| 59.16.196.10.in-addr.arpa. | test106.labtestproject.codfw.labtest. | 2016-05-14 12:48:31 |
16| 18.16.196.10.in-addr.arpa. | testing.mediawiki-core-team.codfw.labtest. | 2016-07-21 14:08:29 |
17| 52.16.196.10.in-addr.arpa. | sshtest.labtestproject.codfw.labtest. | 2016-05-13 19:57:26 |
18| 25.16.196.10.in-addr.arpa. | overquota-instances-4.labtestproject.codfw.labtest. | 2016-08-23 20:02:29 |
19| 26.16.196.10.in-addr.arpa. | quota-test.admin.codfw.labtest. | 2016-08-07 15:53:56 |
20| 73.16.196.10.in-addr.arpa. | hostconsoletest1.labtestproject.codfw.labtest. | 2016-07-21 14:08:07 |
21| 60.16.196.10.in-addr.arpa. | test107.labtestproject.codfw.labtest. | 2016-05-14 12:48:31 |
22| 57.16.196.10.in-addr.arpa. | test104.labtestproject.codfw.labtest. | 2016-05-14 12:48:31 |
23| 24.16.196.10.in-addr.arpa. | overquota-instances-3.labtestproject.codfw.labtest. | 2016-08-23 20:02:30 |
24| 21.16.196.10.in-addr.arpa. | quotatest.admin.codfw.labtest. | 2016-08-05 16:05:22 |
25| 46.16.196.10.in-addr.arpa. | ldapfix.labtestproject.codfw.labtest. | 2016-05-12 22:08:52 |
26| 29.16.196.10.in-addr.arpa. | horizon-launch-test.admin.codfw.labtest. | 2016-09-12 15:06:10 |
27| 58.16.196.10.in-addr.arpa. | test105.labtestproject.codfw.labtest. | 2016-05-14 12:48:31 |
28| 28.16.196.10.in-addr.arpa. | quota-test-3.admin.codfw.labtest. | 2016-08-07 15:54:04 |
29| 41.16.196.10.in-addr.arpa. | nettest115.labtestproject.codfw.labtest. | 2016-05-12 22:08:52 |
30| 2.16.196.10.in-addr.arpa. | test101.labtestproject.codfw.labtest. | 2016-05-13 21:21:11 |
31| 16.16.196.10.in-addr.arpa. | test3.labtestproject.codfw.labtest. | 2016-05-11 19:20:26 |
32| 42.16.196.10.in-addr.arpa. | digtest101.labtestproject.codfw.labtest. | 2016-05-12 19:27:21 |
33+----------------------------+-----------------------------------------------------+---------------------+
3428 rows in set (0.01 sec)

The script was run against real-labs in T120797 and most existing problem cases should be gone now

Alex can you do the magic SELECT again and see whether DNS entries are still being leaked?

Alex can you do the magic SELECT again and see whether DNS entries are still being leaked?

I don't have access to run the query posted above in real-labs (only labtest, where we don't create/delete instances often enough to catch this), only ops can do that right now

This continues to cause issues. Clush doesn't work from tools-puppetmaster-02, at least partially because:

Oct 31 20:16:55 tools-elastic-01 sshd[32448]: reverse mapping checking getaddrinfo for ci-trusty-wikimedia-163713.contintcloud.eqiad.wmf
labs [10.68.18.245] failed - POSSIBLE BREAK-IN ATTEMPT!

Change 319090 had a related patch set uploaded (by Andrew Bogott):
nova_fixed_multi: Change a bunch of debug messages to warnings

https://gerrit.wikimedia.org/r/319090

Change 319090 merged by Andrew Bogott:
nova_fixed_multi: Change a bunch of debug messages to warnings

https://gerrit.wikimedia.org/r/319090

Apologies if I'm repeating previous comments...

This issue is produced in two stages:

  1. designate records are leaked (at which point no one notices or cares)
  2. new designate records are created which re-use IPs that were used by the leaked records in step 1)

As far as I can tell, most investigation of this issue has focused on the collisions from step 2) since that's what actually causes us trouble. Part of the difficulty with debugging is that nothing is going wrong at point 2 -- the bug happened back at step 1, possibly months or even years ago.

The fact that we periodically clean up collisions and then find new collisions doesn't necessarily mean that 1) is still happening. It probably still is, but it's hard to be sure.

So, I've written an extremely ugly script to find all records on IPs that are not currently assigned to actual instances, and clean them up. I'll then go on to monitor the list of stage-1 leaks for the next while and see if I can find any patterns there.

Change 319759 had a related patch set uploaded (by Andrew Bogott):
Designate nova_fixed_multi plugin: avoid race conditions

https://gerrit.wikimedia.org/r/319759

Change 319759 merged by Andrew Bogott:
Designate nova_fixed_multi plugin: avoid race conditions

https://gerrit.wikimedia.org/r/319759

we've gone > a week without leaks. Seems unrealistic to close this, but there's nothing left to do at the moment.

elukey@deployment-aqs03:~$ dig -x 10.68.17.125 +short
elukey
ci-jessie-wikimedia-505374.contintcloud.eqiad.wmflabs.
elukey
deployment-aqs03.deployment-prep.eqiad.wmflabs.
elukey
is it normal? :D

Holdover or new? I'm not sure.

From my digging:

May 9th652785
June 8th692016

So I guess 505374 is a few months old.

I (finally) wrote a script to hunt and kill leaked dns records:

https://gerrit.wikimedia.org/r/#/c/358124/

It killed a few duplicates and a whole lot of leaks that were not yet duplicated and so, presumably, went unnoticed. The next step is probably to get the fullstack test to monitor post-deletion dns cleanup, and alert on leaks. Maybe it checks already, I'm not sure.

Andrew lowered the priority of this task from High to Medium.Jun 13 2017, 8:13 PM

right now I'm just checking periodically to see if there are new leaks.

@Andrew is there still a lot of leaks happening? Current Nodepool id is 806287

This no longer happens as a matter of course, but anytime designate-sink locks up we leak things for the duration. Sink is surprisingly fragile and it's broken a few times with recent firewall changes.

I'll run a cleanup job to get the most recent batch of leaks, and then things should be clean again until we break things. I don't think there's a great near-term solution for the issue; in theory the communication model between nova and sink is re-engineered in future versions to avoid stuff like this.

This is as fixed as it's going to be. Any time there's a designate outage I need to run the dnsleaks script to clean up.