Some labs instances IP have multiple PTR entries in DNS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krenair
	Oct 10 2015, 9:24 PM

Description

krenair@bastion-01:~$ host 10.68.18.65
65.18.68.10.in-addr.arpa domain name pointer testlabs-createtest2.testlabs.eqiad.wmflabs.
65.18.68.10.in-addr.arpa domain name pointer bastion-02.bastion.eqiad.wmflabs.

krenair@bastion-01:~$ ping bastion-02
PING bastion-02.bastion.eqiad.wmflabs (10.68.18.65) 56(84) bytes of data.
64 bytes from bastion-02.bastion.eqiad.wmflabs (10.68.18.65): icmp_seq=1 ttl=64 time=0.480 ms
64 bytes from bastion-02.bastion.eqiad.wmflabs (10.68.18.65): icmp_seq=2 ttl=64 time=0.492 ms

krenair@tools-bastion-01:~$ ping bastion-02
PING bastion-02.eqiad.wmflabs (10.68.18.65) 56(84) bytes of data.
64 bytes from testlabs-createtest2.testlabs.eqiad.wmflabs (10.68.18.65): icmp_seq=1 ttl=64 time=0.403 ms
64 bytes from testlabs-createtest2.testlabs.eqiad.wmflabs (10.68.18.65): icmp_seq=2 ttl=64 time=0.582 ms

Also this one:

krenair@bastion-01:~$ host 10.68.17.12
12.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-22792.contintcloud.eqiad.wmflabs.
12.17.68.10.in-addr.arpa domain name pointer analytics203.analytics.eqiad.wmflabs.

krenair@bastion-01:~$ ping analytics203
PING analytics203.eqiad.wmflabs (10.68.17.12) 56(84) bytes of data.
64 bytes from ci-jessie-wikimedia-22792.contintcloud.eqiad.wmflabs (10.68.17.12): icmp_seq=1 ttl=64 time=1.60 ms
64 bytes from ci-jessie-wikimedia-22792.contintcloud.eqiad.wmflabs (10.68.17.12): icmp_seq=2 ttl=64 time=0.487 ms

krenair@tools-bastion-01:~$ ping analytics203
PING analytics203.eqiad.wmflabs (10.68.17.12) 56(84) bytes of data.
64 bytes from analytics203.analytics.eqiad.wmflabs (10.68.17.12): icmp_seq=1 ttl=64 time=0.420 ms
64 bytes from analytics203.analytics.eqiad.wmflabs (10.68.17.12): icmp_seq=2 ttl=64 time=0.405 ms

Details

	Subject	Repo	Branch	Lines +/-
	Designate nova_fixed_multi plugin: avoid race conditions	operations/puppet	production	+47 -21
	nova_fixed_multi: Change a bunch of debug messages to warnings	operations/puppet	production	+17 -17

Customize query in gerrit

Related Objects

Mentioned In: T142877: Request increased quota for contintcloud labs project
T134025: LDAP contains two extra incorrect host entries with aRecord=10.68.17.118, one with aRecord=10.68.22.5, and one with aRecord=10.68.16.120
T115330: block labs IPs from sending data to prod ganglia
T130471: Cannot login into cac.rcm.eqiad.wmflabs
T55816: Hostnames assigned to floating IP persist when deallocated
Mentioned Here: T146212: Add 3 webgrid-lighttpd trusty nodes to tools project
P4078 labtest T115194 occurences
T120797: Clean up leaked designate entries
T134025: LDAP contains two extra incorrect host entries with aRecord=10.68.17.118, one with aRecord=10.68.22.5, and one with aRecord=10.68.16.120
T99072: Fix 'unknown's in shinken
P2969 DNS resolver for T115194
T126518: Duplicate entries in labs internal dns

Event Timeline

Krenair created this task.Oct 10 2015, 9:24 PM

Krenair raised the priority of this task from to Needs Triage.

Krenair updated the task description. (Show Details)

Krenair added projects: Cloud-Services, acl*sre-team.

Krenair subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptOct 10 2015, 9:24 PM

yuvipanda renamed this task from 10.68.18.65 resolves to two different instances to RDNS for 10.68.18.65 resolves to two different instances.Oct 10 2015, 9:25 PM

yuvipanda set Security to None.

Krenair mentioned this in T55816: Hostnames assigned to floating IP persist when deallocated.Oct 24 2015, 8:08 PM

• chasemp triaged this task as High priority.Nov 30 2015, 4:43 PM

• chasemp subscribed.

Krenair renamed this task from RDNS for 10.68.18.65 resolves to two different instances to RDNS for some labs instance IPs resolve to multiple different instances.Feb 9 2016, 3:48 PM

Krenair merged a task: T126340: labs dns seems to be holding bad records.

Krenair updated the task description. (Show Details)

Krenair added a subscriber: Andrew.

Krenair updated the task description. (Show Details)Feb 9 2016, 3:51 PM

• dduvall merged a task: T126518: Duplicate entries in labs internal dns.Feb 11 2016, 8:15 PM

• dduvall added subscribers: • dduvall, StudiesWorld.

hashar renamed this task from RDNS for some labs instance IPs resolve to multiple different instances to Some labs instances IP have multiple PTR entries in DNS.Feb 11 2016, 8:17 PM

• dduvall added a parent task: T126537: rebuild deployment-bastion on trusty.Feb 11 2016, 8:25 PM

• dduvall removed a parent task: T126537: rebuild deployment-bastion on trusty.Feb 12 2016, 6:09 PM

hashar added a project: Cloud-VPS.Mar 19 2016, 9:40 PM

hashar mentioned this in T130471: Cannot login into cac.rcm.eqiad.wmflabs.Mar 19 2016, 9:43 PM

krenair@bastion-01:~$ host 10.68.16.66
;; Truncated, retrying in TCP mode.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-53082.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-75884.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-74560.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-69512.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78428.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-67335.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-75977.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-80668.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-69168.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78397.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78788.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-67383.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-62884.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-59032.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer sm-puppetmaster-trusty2.servermon.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-74567.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-64645.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-76015.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-76473.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-78683.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-77527.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-67103.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-65765.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-68088.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-76857.contintcloud.eqiad.wmflabs.
66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-64125.contintcloud.eqiad.wmflabs.

So far there's been one instance in testlabs and quite a lot in contintcloud
@hashar: please point me to the code that sets these contintcloud instances up and deletes them

The system is Nodepool which request creation and deletion of instances via the OpenStack API end point. It creates hundreds of instances per day which would explain why it shows up more often.

Date	# of creation requests
2016-04-07	701
2016-04-08	649
2016-04-09	196
2016-04-10	185
2016-04-11	733
2016-04-12	1095
2016-04-13	919
2016-04-14	932
2016-04-15	633
2016-04-16	102
2016-04-17	137
2016-04-18	788
2016-04-19	1112
2016-04-20	736
2016-04-21	797

I have no idea how DNS is provisioned

https://wikitech.wikimedia.org/wiki/Nodepool

<andrewbogott> Krenair: It's a hack, but I tend to put those things in sink plugins, since sink is already in charge of cleaning up dns entries.
<andrewbogott> (The designate people were just talking about how sink sometimes drops things so they're moving that to a more tightly integrated system… but we won't be using that anytime soon)
<Krenair> is that why we have an issue with broken PTR records?
<andrewbogott> Krenair: It could be from leaks, yes.

Krenair merged a task: T126518: Duplicate entries in labs internal dns.Apr 27 2016, 8:12 PM

Dzahn mentioned this in T115330: block labs IPs from sending data to prod ganglia.Apr 27 2016, 8:15 PM

From T126518

It is back around :(

[21:50:04]  <mutante>	dzahn@bastion-restricted-01:~$ host 10.68.16.66
[21:50:04]  <mutante>	;; Truncated, retrying in TCP mode.
[21:50:04]  <mutante>	66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-53082.contintcloud.eqiad.wmflabs.
[21:50:06]  <mutante>	66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-77527.contintcloud.eqiad.wmflabs.
[21:50:09]  <mutante>	66.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-75884.contintcloud.eqiad.wmflabs.
[21:50:09]  <mutante>	< 27 lines >

I have no idea whether forward entries leak as well. Maybe we could dump the contintcloud.eqiad.wmflabs zone and see how many entries there is? Should be less than 20, the quota for that tenant.

I can't AXFR though, so little things I can investigate without brute forcing DNS requests

$ dig AXFR contintcloud.eqiad.wmflabs. @labs-ns0.wikimedia.org.
Transfer failed.

there are relatively many ldap connection failures in the sink log. That fits with the fact that our designate setup is subject to periodic OOMs. I need to have another look at the sink code and verify that other plugins still run if one of them errors out.

It would also be useful to know if we are leaking A records that correspond to the leaked PTR records.

I wrote a stupid resolver for the A records:

P2969 DNS resolver for T115194

1	#!/usr/bin/env python2
2
3	import argparse
4	import dns.resolver
5	from time import sleep
6
7	p = argparse.ArgumentParser()
8	p.add_argument('--delay', type=float, default=1.0,
9	help='Delay between DNS queries')
10	p.add_argument('range', help='Range of ID. Ex: 85-90')
11	opts = p.parse_args()
12
13	try:
14	(start, stop) = [int(border) for border in opts.range.split('-')]
15	except ValueError:
16	p.error("Invalid range should be: <start>-<end>")
17
18	fqdn_template = 'ci-jessie-wikimedia-%s.contintcloud.eqiad.wmflabs.'
19
20	print "Start: %s" % (fqdn_template % start)
21	print "Stop: %s" % (fqdn_template % stop)
22
23	print "Querying DNS for A records ..."
24	for index, host_id in enumerate(xrange(start, stop), start=1):
25	fqdn = fqdn_template % host_id
26	try:
27	answers = dns.resolver.query(fqdn, 'A')
28	print "Found %s" % fqdn
29	except dns.resolver.NXDOMAIN:
30	continue
31	finally:
32	sleep(opts.delay)
33	print "Did %s queries. Done." % index

Running it right now from deployment-tin and range 70000-85000.

Out of 15000 A entries, only one leaked:

$ python blam.py --delay 0.1 70000-85000
Start: ci-jessie-wikimedia-70000.contintcloud.eqiad.wmflabs.
Stop:  ci-jessie-wikimedia-85000.contintcloud.eqiad.wmflabs.
Querying DNS for A records ...
Found ci-jessie-wikimedia-70569.contintcloud.eqiad.wmflabs.
Did 15000 queries. Done.

While looking at T99072: Fix 'unknown's in shinken I found another:

krenair@tools-bastion-03:~$ host 10.68.16.97
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-59441.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-47624.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-49202.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-52995.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-59269.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-51209.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-64225.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-51020.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer tools-worker-1011.tools.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-47309.contintcloud.eqiad.wmflabs.
97.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-60973.contintcloud.eqiad.wmflabs.

This is also a case of T134025

Krenair mentioned this in T134025: LDAP contains two extra incorrect host entries with aRecord=10.68.17.118, one with aRecord=10.68.22.5, and one with aRecord=10.68.16.120.Apr 30 2016, 1:58 AM

scfc merged a task: T135864: Weird extra DNS entries in labs.May 20 2016, 8:15 PM

scfc added subscribers: yuvipanda, Ottomata, Zppix.

krenair@tools-bastion-03:~$ host 10.68.17.58
;; Truncated, retrying in TCP mode.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-104356.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-117884.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87761.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-115559.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-112332.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-50058.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-118839.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-117685.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-111507.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-72171.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-96703.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-117105.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-94967.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-117256.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-103395.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-108704.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-117855.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-112002.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-106514.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-120291.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-102473.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-113082.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-100550.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-119443.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer deployment-salt02.deployment-prep.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-112763.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-115063.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-114045.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87738.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-118180.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-110276.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87655.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-88878.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-111738.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-120337.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-104783.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-94612.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-46539.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-85918.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-118867.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-108158.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-119834.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-107152.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-98053.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-83531.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-102838.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-109855.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-114052.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-113735.contintcloud.eqiad.wmflabs.
58.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-103081.contintcloud.eqiad.wmflabs.

krenair@mira:~$ host 10.68.17.146
146.17.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-140638.contintcloud.eqiad.wmflabs.
146.17.68.10.in-addr.arpa domain name pointer testlabs-horizontest-84926088-3f8c-4db2-9f92-2592cf9a4fcf.extdist.eqiad.wmflabs.
146.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-139291.contintcloud.eqiad.wmflabs.
146.17.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-154196.contintcloud.eqiad.wmflabs.

yuvipanda unsubscribed.Jun 30 2016, 2:06 PM

<mutante> 121.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-47938.contintcloud.eqiad.wmflabs.
<mutante> 121.16.68.10.in-addr.arpa domain name pointer petscan1.petscan.eqiad.wmflabs.

Until the DNS leak is identified entries will keep leaking. It is quite easy to retrieve all of them from the Designate database, so there is no need to reply with any PTR entries you might find.

Ok, except @Andrew just told me to tack these on... :)

otto@deployment-kafka03:~$ host 10.68.16.138
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87841.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-90123.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-89281.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-88514.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-87903.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer deployment-kafka03.deployment-prep.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-89145.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-trusty-wikimedia-91084.contintcloud.eqiad.wmflabs.
138.16.68.10.in-addr.arpa domain name pointer ci-jessie-wikimedia-90612.contintcloud.eqiad.wmflabs.

@Ottomata Fair call sorry :-)

Nodepool spawns instances with an incremental ID to give some indication about the progress:

Time (UTC)	ID
2016-03-14	50000
2016-03-22	60000
2016-04-04	70000
2016-04-20	80000
2016-05-04	90000
2016-05-13	100000
2016-05-31	125000
2016-06-17	150000
2016-06-21 7am	152745
2016-06-26 midnight	159569
2016-07-11 midnight	175997

From the ID pasted previously, maybe that got partially fixed / mitigated? The designate database definitely has a ton of entries for the contintcloud project. We could get an exhaustive list by doing an SQL query there.

I've been thinking maybe we should try nodepool in labtest (running at a much smaller scale) so we can take a closer look at this...

In T115194#2493052, @AlexMonk-WMF wrote:

I've been thinking maybe we should try nodepool in labtest (running at a much smaller scale) so we can take a closer look at this...

One could write a script that brute spawn/delete instances in a loop against labtest. https://pypi.python.org/pypi/shade should make that straightforward :]

yuvipanda mentioned this in T142877: Request increased quota for contintcloud labs project.Aug 12 2016, 10:33 PM

Paladox subscribed.Aug 13 2016, 7:51 AM

In T115194#2244874, @Andrew wrote:

there are relatively many ldap connection failures in the sink log. That fits with the fact that our designate setup is subject to periodic OOMs. I need to have another look at the sink code and verify that other plugins still run if one of them errors out.

I'm not sure about what happens when nova_ldap errors out, but nova_fixed_multi should be catching everything and dumping it to the log.

Here are some relatively recent issues on labtestservices:

root@labtestservices2001:/var/log/designate# zcat designate-sink.log.*.gz | grep -v "File \"" | grep -v "     " | grep -v Traceback | grep -v "\[req-"| grep -o "oslo_messaging.notify.dispatcher [^$].*" | sort -d | uniq
oslo_messaging.notify.dispatcher LDAPError: LDAP connection invalid
oslo_messaging.notify.dispatcher NotFound: Instance 18b898ae-818a-4b5e-81fe-79b76ee71d38 could not be found. (HTTP 404)
oslo_messaging.notify.dispatcher NotFound: Instance 32e47f2b-1b0e-4bd2-8721-a5c4fb768e2d could not be found. (HTTP 404)
oslo_messaging.notify.dispatcher NotFound: Instance 353d926f-a863-4d36-94be-4317a9a2e68a could not be found. (HTTP 404)
root@labtestservices2001:/var/log/designate#

greg added a project: Wikimedia-Incident.Aug 15 2016, 6:58 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

I've written a script to hopefully purge the vast majority of problematic entries in T120797

In T146212#2655916, @AlexMonk-WMF wrote:

We looked into it last night, but weren't able to find the cause. We do know that the last instance to leave a reverse DNS entry behind was deleted around 2016-09-08 22:46 (@madhuvishy ran select max(nova.instances.deleted_at) from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '8d114f3c815b466cbdd49b91f704ea60' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null; for me, which eventually resulted in 2016-09-08 22:46:45 after the query ran for 10-15 minutes) - you'd expect the DNS entry to be deleted within a minute of that
Unfortunately @madhuvishy found nothing useful from the logs around that time, and I'm not sure whether it's now a historical thing that needs a one-off cleanup, or an ongoing issue.

Here's the extent of the issue in labtest:

P4078 labtest T115194 occurences

class="paste-embed-body" style="max-height: 27.6em;">

mysql:root@localhost [designate]> select recordsets.name, records.data, nova.instances.deleted_at from records join recordsets on records.recordset_id = recordsets.id left join nova.instances on replace(nova.instances.uuid, '-', '') = records.managed_resource_id where records.domain_id = '9b60f3abd64b4e309d6f7535811b0fa8' and recordsets.name like '%.10.in-addr.arpa.' and recordsets.type = 'PTR' and nova.instances.deleted_at is not null; --------------+---------------------+ | deleted_at | --------------+---------------------+ | 2016-05-16 15:24:28 | | 2016-08-05 16:03:39 | | 2016-05-12 19:27:21 | | 2016-05-11 19:20:19 | | 2016-05-11 14:45:56 | | 2016-05-13 20:45:16 | | 2016-04-21 15:24:21 | codfw.labtest. | 2016-08-23 20:02:30 | | 2016-07-22 16:31:47 | | 2016-04-21 20:00:08 | | 2016-05-14 12:48:31 | | 2016-07-21 14:08:29 | | 2016-05-13 19:57:26 | codfw.labtest. | 2016-08-23 20:02:29 | | 2016-08-07 15:53:56 | | 2016-07-21 14:08:07 | | 2016-05-14 12:48:31 | | 2016-05-14 12:48:31 | codfw.labtest. | 2016-08-23 20:02:30 | | 2016-08-05 16:05:22 | | 2016-05-12 22:08:52 | | 2016-09-12 15:06:10 | | 2016-05-14 12:48:31 | | 2016-08-07 15:54:04 | | 2016-05-12 22:08:52 | | 2016-05-13 21:21:11 | | 2016-05-11 19:20:26 | | 2016-05-12 19:27:21 | ---------------+---------------------+

The script was run against real-labs in T120797 and most existing problem cases should be gone now

Alex can you do the magic SELECT again and see whether DNS entries are still being leaked?

In T115194#2714378, @hashar wrote:

Alex can you do the magic SELECT again and see whether DNS entries are still being leaked?

I don't have access to run the query posted above in real-labs (only labtest, where we don't create/delete instances often enough to catch this), only ops can do that right now

This continues to cause issues. Clush doesn't work from tools-puppetmaster-02, at least partially because:

Oct 31 20:16:55 tools-elastic-01 sshd[32448]: reverse mapping checking getaddrinfo for ci-trusty-wikimedia-163713.contintcloud.eqiad.wmf
labs [10.68.18.245] failed - POSSIBLE BREAK-IN ATTEMPT!

Change 319090 had a related patch set uploaded (by Andrew Bogott):
nova_fixed_multi: Change a bunch of debug messages to warnings

https://gerrit.wikimedia.org/r/319090

gerritbot added a project: Patch-For-Review.Nov 1 2016, 3:06 PM

Change 319090 merged by Andrew Bogott:
nova_fixed_multi: Change a bunch of debug messages to warnings

https://gerrit.wikimedia.org/r/319090

Apologies if I'm repeating previous comments...

This issue is produced in two stages:

designate records are leaked (at which point no one notices or cares)
new designate records are created which re-use IPs that were used by the leaked records in step 1)

As far as I can tell, most investigation of this issue has focused on the collisions from step 2) since that's what actually causes us trouble. Part of the difficulty with debugging is that nothing is going wrong at point 2 -- the bug happened back at step 1, possibly months or even years ago.

The fact that we periodically clean up collisions and then find new collisions doesn't necessarily mean that 1) is still happening. It probably still is, but it's hard to be sure.

So, I've written an extremely ugly script to find all records on IPs that are not currently assigned to actual instances, and clean them up. I'll then go on to monitor the list of stage-1 leaks for the next while and see if I can find any patterns there.

Change 319759 had a related patch set uploaded (by Andrew Bogott):
Designate nova_fixed_multi plugin: avoid race conditions

https://gerrit.wikimedia.org/r/319759

Change 319759 merged by Andrew Bogott:
Designate nova_fixed_multi plugin: avoid race conditions

https://gerrit.wikimedia.org/r/319759

Andrew claimed this task.Nov 4 2016, 2:37 PM

we've gone > a week without leaks. Seems unrealistic to close this, but there's nothing left to do at the moment.

Good news @Andrew thank you.

elukey@deployment-aqs03:~$ dig -x 10.68.17.125 +short
elukey
ci-jessie-wikimedia-505374.contintcloud.eqiad.wmflabs.
elukey
deployment-aqs03.deployment-prep.eqiad.wmflabs.
elukey
is it normal? :D

Holdover or new? I'm not sure.

• chasemp added a subscriber: elukey.Jun 8 2017, 1:55 PM

From my digging:

May 9th	`652785`
June 8th	`692016`

So I guess 505374 is a few months old.

I (finally) wrote a script to hunt and kill leaked dns records:

https://gerrit.wikimedia.org/r/#/c/358124/

It killed a few duplicates and a whole lot of leaks that were not yet duplicated and so, presumably, went unnoticed. The next step is probably to get the fullstack test to monitor post-deletion dns cleanup, and alert on leaks. Maybe it checks already, I'm not sure.

right now I'm just checking periodically to see if there are new leaks.

@Andrew is there still a lot of leaks happening? Current Nodepool id is 806287

This no longer happens as a matter of course, but anytime designate-sink locks up we leak things for the duration. Sink is surprisingly fragile and it's broken a few times with recent firewall changes.

I'll run a cleanup job to get the most recent batch of leaks, and then things should be clean again until we break things. I don't think there's a great near-term solution for the issue; in theory the communication model between nova and sink is re-engineered in future versions to avoid stuff like this.

This is as fixed as it's going to be. Any time there's a designate outage I need to run the dnsleaks script to clean up.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident, Cloud-VPS.Apr 28 2020, 9:50 PM

Some labs instances IP have multiple PTR entries in DNSClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Some labs instances IP have multiple PTR entries in DNS
Closed, ResolvedPublic
Actions