Page MenuHomePhabricator

cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL
Closed, ResolvedPublic

Description

It seems that since ~10am UTC time today the vms stoped being reachable through ssh, looking

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,233 INFO cloudvps.novafullstack.cloudcontrol1003.instances.count => 1 1611140036
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,234 INFO cloudvps.novafullstack.cloudcontrol1003.instances.max => 11 1611140036
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,417 INFO Creating fullstackd-20210120105355
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,319 INFO cloudvps.novafullstack.cloudcontrol1003.verify.creation => 31.9 1611140068
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,319 INFO Resolving fullstackd-20210120105355.admin-monitoring.eqiad1.wikimedia.cloud from ['208.80.154.143', '208.80.154.24']
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,348 INFO cloudvps.novafullstack.cloudcontrol1003.verify.dns => 0.03 1611140068
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,348 INFO SSH to 172.16.2.115
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 11:09:30,212 ERROR fullstackd-20210120105355 failed, leaking
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: Traceback (most recent call last):
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: File "/usr/local/sbin/nova-fullstack", line 629, in main
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: args.ssh_timeout)
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: File "/usr/local/sbin/nova-fullstack", line 187, in verify_ssh
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: raise Exception("SSH for {} timed out".format(address))
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: Exception: SSH for 172.16.2.115 timed out

The issue seems to be only when trying to reach the vms from the cloudcontrol
nodes, from my laptop I can ssh (at least 1003, looking):

03:05 PM ~/Work/wikimedia/wmcs-ansible  (master|✚ 4…6) 
dcaro@vulcanus$ ssh fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud
The authenticity of host 'fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (<no hostip for proxy command>)' can't be established.
ECDSA key fingerprint is SHA256:8i8SidqLUz5ubWBBUp4TxurYRNj65tfFz1bGwSsXJ8Y.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud' (ECDSA) to the list of known hosts.
Creating directory '/home/dcaro'.
Linux fullstackd-20210120133811 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64
Debian GNU/Linux 10 (buster)
The last Puppet run was at Wed Jan 20 13:56:32 UTC 2021 (9 minutes ago). 
Last puppet commit: (eec021d363) Jcrespo - Ad

This is what a traceroute (tcp, port 22) looks like from the cloudcontrol1003:

traceroute to fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188), 30 hops max, 60 byte packets
 1  ae1-1001.cr1-eqiad.wikimedia.org (208.80.154.2)  0.269 ms  0.240 ms  0.179 ms
 2  irb-1102.cloudsw1-c8-eqiad.wikimedia.org (208.80.154.211)  11.634 ms  11.635 ms  11.622 ms
 3  185.15.56.244 (185.15.56.244)  0.987 ms  0.525 ms  0.512 ms
 --- all *** from here ---

This is what it looks like from a bastion (works):

dcaro@bastion-restricted-eqiad1-01:~$ sudo traceroute fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud -p 22 -T
traceroute to fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188), 30 hops max, 60 byte packets
 1  fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188)  0.546 ms  0.508 ms  0.478 ms

Change 657345 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/homer/public@master] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"

https://gerrit.wikimedia.org/r/657345

Change 657345 merged by jenkins-bot:
[operations/homer/public@master] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"

https://gerrit.wikimedia.org/r/657345

Ended up being a firewall change, reverted it for now though the long term
solution will have to come later (@aborrero probably will fix the mess xd).
Firewall rules applied (gerrit merge + cumin1001$ homer cr*-eqiad* commit) and
ssh is back running, so the tests will pass on the next run.

Change 657358 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/homer/public@master] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack

https://gerrit.wikimedia.org/r/657358

Change 657358 merged by jenkins-bot:
[operations/homer/public@master] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack

https://gerrit.wikimedia.org/r/657358

aborrero added a parent task: Restricted Task.Jan 21 2021, 11:30 AM