Page MenuHomePhabricator

codfw1dev unavailable?
Closed, ResolvedPublic


I'm failing to connect to - it resolves successfully but is not responding to SSH?
The security group rules look fine. Is it ferm or the codfw1dev networking in general?

Event Timeline

Andrew added a subscriber: Andrew.

Previously this was a nova-compute issue, solved with

Now it appears to be a networking issue; hosts can reach other VMs but not contact outside servers (including the name server):

root@cloudvirt2001-dev:~# virsh console 47e414aa-03ec-4c7c-b632-9b6cf5e37119
Connected to domain i-00000462
Escape character is ^]

root@bastion-codfw1dev-01:~# cat /etc/resolv.conf 
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

options timeout:2 ndots:1
root@bastion-codfw1dev-01:~# ping
PING ( 56(84) bytes of data.
--- ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 183ms

root@bastion-codfw1dev-01:~# ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.561 ms
64 bytes from icmp_seq=2 ttl=64 time=0.642 ms
--- ping statistics ---

This may be related to some firewall changes made on Friday; in any case I'm going to drop this in @aborrero 's lap at least until I'm back to work on Tuesday.

(Note that I also tested with a fully external IP, for and it can't reach that either)

neutron might not be doing proper SNAT . Investigating.

Mentioned in SAL (#wikimedia-cloud) [2020-03-10T13:55:04Z] <arturo> [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135

Andrew triaged this task as Medium priority.Mar 10 2020, 4:06 PM
Andrew raised the priority of this task from Medium to Needs Triage.
Andrew moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-03-10T17:02:11Z] <arturo> [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135

I confirm that address scopes have a bad interaction with our setup. I was using address scopes as part of the BGP configuration.

I can see now neutron doing SNAT:

# inside netns
root@cloudnet2003-dev:~ # conntrack -E -j -p icmp
    [NEW] icmp     1 30 src= dst= type=8 code=0 id=25513 [UNREPLIED] src= dst= type=0 code=0 id=25513 mark=67108864

# main netns
aborrero@cloudnet2003-dev:~ $ sudo tcpdump -i br-external icmp
09:33:18.126335 IP > ICMP echo request, id 25519, seq 1, length 64

The packet never returns, which may indicate a filtering problem related to T246887: CloudVPS: introduce filtering for neutron BGP addresses.

The BGP-related filter has been dropped. You should be able to contact now floating IPs from the internet and VM should have full connectivity.

As of right now:

arturo@endurance:~ $ ssh -i .ssh/wmf_cloud_root_arturo root@
Enter passphrase for key '.ssh/wmf_cloud_root_arturo': 
Linux bastion-codfw1dev-02 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64
Debian GNU/Linux 10 (buster)
The last Puppet run was at Wed Mar 18 11:07:33 UTC 2020 (10 minutes ago). 
Last puppet commit: (641fe4e349) Jbond - debdeploy: add libGraphicsMagick-Q16 as a lib for graphicsmagick
Last login: Mon Mar  9 20:03:25 2020
root@bastion-codfw1dev-02:~# apt-get update
Hit:1 buster InRelease
Hit:2 buster-updates InRelease                                               
Get:3 buster-wikimedia InRelease [34.4 kB]                             
Hit:4 buster-backports InRelease                                      
Hit:5 buster/updates InRelease                       
Get:6 buster-wikimedia/main Sources [24.8 kB]
Get:7 buster-wikimedia/main amd64 Packages [36.7 kB]
Fetched 95.9 kB in 1s (161 kB/s)
Reading package lists... Done

Thanks Arturo. Confirmed I can get in, the bastion has internet access, and I can SSH through to internal instances and connect out from those too.