It seems that since ~10am UTC time today the vms stoped being reachable through ssh, looking
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T272395 Cloud: reduce NAT exceptions from cloud to production | |||
Restricted Task | |||||
Resolved | dcaro | T272486 cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL |
Event Timeline
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,233 INFO cloudvps.novafullstack.cloudcontrol1003.instances.count => 1 1611140036
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,234 INFO cloudvps.novafullstack.cloudcontrol1003.instances.max => 11 1611140036
Jan 20 10:53:56 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:53:56,417 INFO Creating fullstackd-20210120105355
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,319 INFO cloudvps.novafullstack.cloudcontrol1003.verify.creation => 31.9 1611140068
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,319 INFO Resolving fullstackd-20210120105355.admin-monitoring.eqiad1.wikimedia.cloud from ['208.80.154.143', '208.80.154.24']
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,348 INFO cloudvps.novafullstack.cloudcontrol1003.verify.dns => 0.03 1611140068
Jan 20 10:54:28 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 10:54:28,348 INFO SSH to 172.16.2.115
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: 2021-01-20 11:09:30,212 ERROR fullstackd-20210120105355 failed, leaking
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: Traceback (most recent call last):
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: File "/usr/local/sbin/nova-fullstack", line 629, in main
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: args.ssh_timeout)
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: File "/usr/local/sbin/nova-fullstack", line 187, in verify_ssh
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: raise Exception("SSH for {} timed out".format(address))
Jan 20 11:09:30 cloudcontrol1003 nova-fullstack[42910]: Exception: SSH for 172.16.2.115 timed out
The issue seems to be only when trying to reach the vms from the cloudcontrol
nodes, from my laptop I can ssh (at least 1003, looking):
03:05 PM ~/Work/wikimedia/wmcs-ansible (master|✚ 4…6) dcaro@vulcanus$ ssh fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud The authenticity of host 'fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (<no hostip for proxy command>)' can't be established. ECDSA key fingerprint is SHA256:8i8SidqLUz5ubWBBUp4TxurYRNj65tfFz1bGwSsXJ8Y. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud' (ECDSA) to the list of known hosts. Creating directory '/home/dcaro'. Linux fullstackd-20210120133811 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 Debian GNU/Linux 10 (buster) The last Puppet run was at Wed Jan 20 13:56:32 UTC 2021 (9 minutes ago). Last puppet commit: (eec021d363) Jcrespo - Ad
This is what a traceroute (tcp, port 22) looks like from the cloudcontrol1003:
traceroute to fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188), 30 hops max, 60 byte packets 1 ae1-1001.cr1-eqiad.wikimedia.org (208.80.154.2) 0.269 ms 0.240 ms 0.179 ms 2 irb-1102.cloudsw1-c8-eqiad.wikimedia.org (208.80.154.211) 11.634 ms 11.635 ms 11.622 ms 3 185.15.56.244 (185.15.56.244) 0.987 ms 0.525 ms 0.512 ms --- all *** from here ---
This is what it looks like from a bastion (works):
dcaro@bastion-restricted-eqiad1-01:~$ sudo traceroute fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud -p 22 -T traceroute to fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188), 30 hops max, 60 byte packets 1 fullstackd-20210120133811.admin-monitoring.eqiad1.wikimedia.cloud (172.16.4.188) 0.546 ms 0.508 ms 0.478 ms
Change 657345 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/homer/public@master] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"
Change 657345 merged by jenkins-bot:
[operations/homer/public@master] Revert "Discard the non-whitelisted 172.16.0.0/12 traffic"
Ended up being a firewall change, reverted it for now though the long term
solution will have to come later (@aborrero probably will fix the mess xd).
Firewall rules applied (gerrit merge + cumin1001$ homer cr*-eqiad* commit) and
ssh is back running, so the tests will pass on the next run.
Change 657358 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/homer/public@master] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack
Change 657358 merged by jenkins-bot:
[operations/homer/public@master] cr/firewall.conf: cloud-in4: introduce ACL for novafullstack
Mentioned in SAL (#wikimedia-cloud) [2021-01-21T11:30:18Z] <arturo> merging core router firewall changes https://gerrit.wikimedia.org/r/c/operations/homer/public/+/657358 (T272486, T209082)