cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Jan 20 2021, 1:58 PM

Description

It seems that since ~10am UTC time today the vms stoped being reachable through ssh, looking

Details

	Subject	Repo	Branch	Lines +/-
	Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"	operations/homer/public	master	+2 -1
	cr/firewall.conf: cloud-in4: introduce ACL for novafullstack	operations/homer/public	master	+24 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T272395 Cloud: reduce NAT exceptions from cloud to production
		Restricted Task
Resolved	dcaro	T272486 cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL

Event Timeline

dcaro created this task.Jan 20 2021, 1:58 PM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJan 20 2021, 1:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcaro claimed this task.Jan 20 2021, 1:59 PM

Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,233 INFO cloudvps.novafullstack.cloudcontrol1003.instances.count => 1 1611140036
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,234 INFO cloudvps.novafullstack.cloudcontrol1003.instances.max => 11 1611140036
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,417 INFO Creating fullstackd-20210120105355
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,319 INFO cloudvps.novafullstack.cloudcontrol1003.verify.creation => 31.9 1611140068
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,319 INFO Resolving fullstackd-20210120105355.admin-monitoring.eqiad1.wikimedia.cloud from ['208.80.154.143', '208.80.154.24']
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,348 INFO cloudvps.novafullstack.cloudcontrol1003.verify.dns => 0.03 1611140068
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,348 INFO SSH to 172.16.2.115
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 11:09:30,212 ERROR fullstackd-20210120105355 failed, leaking
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: Traceback (most recent call last):
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: File "/usr/local/sbin/nova-fullstack", line 629, in main
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: args.ssh_timeout)
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: File "/usr/local/sbin/nova-fullstack", line 187, in verify_ssh
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: raise Exception("SSH for {} timed out".format(address))
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: Exception: SSH for 172.16.2.115 timed out

The issue seems to be only when trying to reach the vms from the cloudcontrol
nodes, from my laptop I can ssh (at least 1003, looking):

03:05 PM ~/Work/wikimedia/wmcs-ansible  (master|✚ 4…6) 
dcaro@vulcanus$ ssh fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud
The authenticity of host 'fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (<no hostip for proxy command>)' can't be established.
ECDSA key fingerprint is SHA256:8i8SidqLUz5ubWBBUp4TxurYRNj65tfFz1bGwSsXJ8Y.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud' (ECDSA) to the list of known hosts.
Creating directory '/home/dcaro'.
Linux fullstackd-20210120133811 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64
Debian GNU/Linux 10 (buster)
The last Puppet run was at Wed Jan 20 13:56:32 UTC 2021 (9 minutes ago). 
Last puppet commit: (eec021d363) Jcrespo - Ad

This is what a traceroute (tcp, port 22) looks like from the cloudcontrol1003:

traceroute to fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188), 30 hops max, 60 byte packets
 1  ae1-1001.cr1-eqiad.wikimedia.org (208.80.154.2)  0.269 ms  0.240 ms  0.179 ms
 2  irb-1102.cloudsw1-c8-eqiad.wikimedia.org (208.80.154.211)  11.634 ms  11.635 ms  11.622 ms
 3  185.15.56.244 (185.15.56.244)  0.987 ms  0.525 ms  0.512 ms
 --- all *** from here ---

This is what it looks like from a bastion (works):

dcaro@bastion-restricted-eqiad1-01:~$ sudo traceroute fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud -p 22 -T
traceroute to fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188), 30 hops max, 60 byte packets
 1  fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188)  0.546 ms  0.508 ms  0.478 ms

Change 657345 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/homer/public@master] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"

https://gerrit.wikimedia.org/r/657345

gerritbot added a project: Patch-For-Review.Jan 20 2021, 2:31 PM

Change 657345 merged by jenkins-bot:
[operations/homer/public@master] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"

https://gerrit.wikimedia.org/r/657345

Dcaroest mentioned this in rOHPUb1c0f03ccc5c: Revert "Discard the non-whitelisted 172.16.0.0/12 traffic".Jan 20 2021, 2:39 PM

Ended up being a firewall change, reverted it for now though the long term
solution will have to come later (@aborrero probably will fix the mess xd).
Firewall rules applied (gerrit merge + cumin1001$ homer cr*-eqiad* commit) and
ssh is back running, so the tests will pass on the next run.

dcaro closed this task as Resolved.Jan 20 2021, 3:10 PM

Change 657358 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/homer/public@master] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack

https://gerrit.wikimedia.org/r/657358

Change 657358 merged by jenkins-bot:
[operations/homer/public@master] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack

https://gerrit.wikimedia.org/r/657358

Mentioned in SAL (#wikimedia-cloud) [2021-01-21T11:30:18Z] <arturo> merging core router firewall changes https://gerrit.wikimedia.org/r/c/operations/homer/public/+/657358 (T272486, T209082)

aborrero added a parent task: Restricted Task.Jan 21 2021, 11:30 AM

jenkins-bot mentioned this in rOHPU77daf1277ae8: cr/firewall.conf: cloud-in4: introduce ACL for novafullstack.Jan 21 2021, 11:44 AM

aborrero mentioned this in T272587: cloud: current nova-fullstack mechanism requires cloudcontrol nodes to access individual VMs.Jan 21 2021, 12:10 PM

cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICALClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL
Closed, ResolvedPublic
Actions

Related Objects
Search...