Please fix my screw-up - unbreak SSH access to deployment-maps03 VM
Closed, ResolvedPublic

Description

I was looking at some T153468 problems and noticed Puppet was no longer appeared to be managing ferm on this machine. I went to remove ferm - so I did iptables-save > iptables.bak (this was in /root) and iptables -F and then promptly realised that I had locked myself out of the instance (didn't notice the default policy was to drop). In the past I think I would've been able to use salt to fix it, and in prod there's the serial consoles, but I don't know what the present workaround for this is in labs. Any chance someone can fix this?
Here's a copy of the iptables.bak file:

# Generated by iptables-save v1.4.21 on Sat Sep 22 18:18:15 2018
*raw
:PREROUTING ACCEPT [89369:45802432]
:OUTPUT ACCEPT [79604:17226958]
-A PREROUTING -p tcp -m tcp --dport 6379 -j NOTRACK
-A OUTPUT -p tcp -m tcp --sport 6379 -j NOTRACK
COMMIT
# Completed on Sat Sep 22 18:18:15 2018
# Generated by iptables-save v1.4.21 on Sat Sep 22 18:18:15 2018
*filter
:INPUT DROP [21:3714]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [79604:17226958]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m pkttype --pkt-type multicast -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp ! --tcp-flags FIN,SYN,RST,ACK SYN -j DROP
-A INPUT -p icmp -j ACCEPT
-A INPUT -s 10.68.17.232/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.18.65/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.18.66/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.18.68/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.18.91/32 -p tcp -m tcp --dport 9042 -j ACCEPT
-A INPUT -s 10.68.18.91/32 -p tcp -m tcp --dport 9160 -j ACCEPT
-A INPUT -s 10.68.21.205/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.20.135/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 6533 -j ACCEPT
-A INPUT -s 10.68.18.91/32 -p tcp -m tcp --dport 7000 -j ACCEPT
-A INPUT -s 10.68.18.91/32 -p tcp -m tcp --dport 7199 -j ACCEPT
-A INPUT -s 10.68.16.210/32 -j ACCEPT
-A INPUT -s 10.68.18.91/32 -p tcp -m tcp --dport 5432 -j ACCEPT
-A INPUT -s 10.196.0.0/24 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.196.16.0/21 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.196.32.0/24 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.196.48.0/24 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.68.0.0/24 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.68.16.0/21 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.68.32.0/24 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.68.48.0/24 -p tcp -m tcp --dport 9100 -j ACCEPT
-A INPUT -s 10.68.18.91/32 -p tcp -m tcp --dport 6379 -j ACCEPT
-A INPUT -s 10.68.18.66/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.18.68/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -s 10.68.21.105/32 -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 6534 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 6535 -j ACCEPT
COMMIT
Krenair created this task.Sep 22 2018, 6:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2018, 6:27 PM
Krenair updated the task description. (Show Details)Sep 22 2018, 6:34 PM
Andrew closed this task as Resolved.Sep 23 2018, 1:02 AM
Andrew claimed this task.

That VM was OOM and killing processes right and left, so it's possible we were locked out by sshd dying or something else unrelated to the iptables change. In any case, rebooting has restored ssh access.

Mentioned in SAL (#wikimedia-cloud) [2018-09-23T01:05:02Z] <andrewbogott> rebooted deployment-maps03; OOM and also T205195

Oh, and to answer your main question -- there isn't a great workaround for accessing VMs when ssh stops working. Salt was good for that but was also a lot of trouble to maintain and I almost never miss it. We also for a while had a remote-console system set up but it was /also/ more trouble than it was worth (it broke a lot, and was very hard to do securely). So now we just fall back on mounting the drive and tinkering with it when we get desperate.

That VM was OOM and killing processes right and left, so it's possible we were locked out by sshd dying or something else unrelated to the iptables change. In any case, rebooting has restored ssh access.

It happened pretty much the instant that iptables -F returned. Would be a big coincidence. Thank you!

Krenair added a comment.EditedSep 23 2018, 1:57 PM

I've now run iptables -P INPUT ACCEPT, iptables -F, apt-get remove ferm and running puppet again shows ferm has not been re-installed, confirming that ferm was unmanaged by puppet. (stuff still works there btw :))

I've been thinking about how rebooting fixed this - I think because ferm was still installed, rebooting it triggered ferm to replace the iptables rules. (Still, I don't want ferm being installed somewhere that puppet is not keeping it up to date, just strikes me as a liability next time we do something like replace a bastion.)