Page MenuHomePhabricator

codfw1dev unavailable?
Closed, ResolvedPublic

Description

I'm failing to connect to bastion-codfw1dev-01.codfw1dev.wmcloud.org - it resolves successfully but is not responding to SSH?
The security group rules look fine. Is it ferm or the codfw1dev networking in general?

Event Timeline

Krenair created this task.Mar 6 2020, 10:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2020, 10:26 PM
Andrew assigned this task to aborrero.EditedMar 9 2020, 9:32 PM
Andrew added a subscriber: Andrew.

Previously this was a nova-compute issue, solved with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/578378/.

Now it appears to be a networking issue; hosts can reach other VMs but not contact outside servers (including the name server):

root@cloudvirt2001-dev:~# virsh console 47e414aa-03ec-4c7c-b632-9b6cf5e37119
Connected to domain i-00000462
Escape character is ^]

root@bastion-codfw1dev-01:~# cat /etc/resolv.conf 
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain bastioninfra-codfw1dev.codfw1dev.cloud
search bastioninfra-codfw1dev.codfw1dev.cloud codfw1dev.cloud 
nameserver 208.80.153.78
nameserver 208.80.153.78
options timeout:2 ndots:1
root@bastion-codfw1dev-01:~# ping 208.80.153.78
PING 208.80.153.78 (208.80.153.78) 56(84) bytes of data.
^C
--- 208.80.153.78 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 183ms

root@bastion-codfw1dev-01:~# ping 172.16.128.19
PING 172.16.128.19 (172.16.128.19) 56(84) bytes of data.
64 bytes from 172.16.128.19: icmp_seq=1 ttl=64 time=0.561 ms
64 bytes from 172.16.128.19: icmp_seq=2 ttl=64 time=0.642 ms
^C
--- 172.16.128.19 ping statistics ---

This may be related to some firewall changes made on Friday; in any case I'm going to drop this in @aborrero 's lap at least until I'm back to work on Tuesday.

(Note that I also tested with a fully external IP, 216.58.192.196 for google.com and it can't reach that either)

neutron might not be doing proper SNAT . Investigating.

Mentioned in SAL (#wikimedia-cloud) [2020-03-10T13:55:04Z] <arturo> [codfw1dev] rebooting cloudnet2003-dev into linux kernel 4.14 for testing stuff related to T247135

Andrew triaged this task as Medium priority.Mar 10 2020, 4:06 PM
Andrew raised the priority of this task from Medium to Needs Triage.
Andrew moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-03-10T17:02:11Z] <arturo> [codfw1dev] deleting address scopes, bad interaction with our custom NAT setup T247135

I confirm that address scopes have a bad interaction with our setup. I was using address scopes as part of the BGP configuration.

I can see now neutron doing SNAT:

# inside netns
root@cloudnet2003-dev:~ # conntrack -E -j -p icmp
    [NEW] icmp     1 30 src=172.16.128.14 dst=8.8.8.8 type=8 code=0 id=25513 [UNREPLIED] src=8.8.8.8 dst=185.15.57.1 type=0 code=0 id=25513 mark=67108864

# main netns
aborrero@cloudnet2003-dev:~ $ sudo tcpdump -i br-external icmp
09:33:18.126335 IP 185.15.57.1 > dns.google: ICMP echo request, id 25519, seq 1, length 64

The packet never returns, which may indicate a filtering problem related to T246887: CloudVPS: introduce filtering for neutron BGP addresses.

Mentioned in SAL (#wikimedia-cloud) [2020-03-11T12:50:56Z] <arturo> [codfw1dev] several tests creating/deleting address scopes (T244727 T247135 T246887 T245606)

aborrero closed this task as Resolved.Mar 18 2020, 11:29 AM

The BGP-related filter has been dropped. You should be able to contact now floating IPs from the internet and VM should have full connectivity.

As of right now:

arturo@endurance:~ $ ssh -i .ssh/wmf_cloud_root_arturo root@185.15.57.2
Enter passphrase for key '.ssh/wmf_cloud_root_arturo': 
Linux bastion-codfw1dev-02 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64
Debian GNU/Linux 10 (buster)
The last Puppet run was at Wed Mar 18 11:07:33 UTC 2020 (10 minutes ago). 
Last puppet commit: (641fe4e349) Jbond - debdeploy: add libGraphicsMagick-Q16 as a lib for graphicsmagick
Last login: Mon Mar  9 20:03:25 2020
root@bastion-codfw1dev-02:~# apt-get update
Hit:1 http://deb.debian.org/debian buster InRelease
Hit:2 http://deb.debian.org/debian buster-updates InRelease                                               
Get:3 http://apt.wikimedia.org/wikimedia buster-wikimedia InRelease [34.4 kB]                             
Hit:4 http://mirrors.wikimedia.org/debian buster-backports InRelease                                      
Hit:5 http://security.debian.org buster/updates InRelease                       
Get:6 http://apt.wikimedia.org/wikimedia buster-wikimedia/main Sources [24.8 kB]
Get:7 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages [36.7 kB]
Fetched 95.9 kB in 1s (161 kB/s)
Reading package lists... Done

Thanks Arturo. Confirmed I can get in, the bastion has internet access, and I can SSH through to internal instances and connect out from those too.