Page MenuHomePhabricator

cloudcontrol2001-dev can't reach cloud-vps public IPs
Closed, ResolvedPublic

Description

I'm not sure if this is by design or accident, but I'm noticing this when trying to reach the codfw1dev bastion:

andrew@cloudcontrol2005-dev:~$ telnet 185.15.57.2 22
Trying 185.15.57.2...
Connected to 185.15.57.2.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.9p1 Debian-10+deb10u2

vs

root@cloudcontrol2001-dev:~# telnet 185.15.57.2 22
Trying 185.15.57.2...

This isn't an immediate issue (I'll just move the fullstack tests to a different cloudcontrol for now) but if we aren't planning to support that connectivity in the future we'll need to figure out a different way to test VM creation.

Related Objects

StatusSubtypeAssignedTask
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
InvalidNone
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero
Invalidaborrero
Resolvedaborrero
Resolvedfgiunchedi
Resolvedaborrero
Invalidaborrero
Resolvedaborrero
Resolvedcmooney

Event Timeline

Change 921085 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] codfw1dev: move nova-fullstack test to cloudcontrol2005-dev

https://gerrit.wikimedia.org/r/921085

Change 921085 merged by Andrew Bogott:

[operations/puppet@production] codfw1dev: move nova-fullstack test to cloudcontrol2005-dev

https://gerrit.wikimedia.org/r/921085

Change 921087 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] move nova-fullstack test to cloudcontrol2005-dev and cloudcontrol1007

https://gerrit.wikimedia.org/r/921087

Change 921087 merged by Andrew Bogott:

[operations/puppet@production] move nova-fullstack test to cloudcontrol2005-dev and cloudcontrol1007

https://gerrit.wikimedia.org/r/921087

I think what happens is:

  • cloudcontrol2001-dev tries to contact bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org or 185.15.57.2 using its wmnet interface with address 10.192.20.9.
  • I see such packets circulating through cloudgw:
aborrero@cloudgw2003-dev:~ $ sudo tcpdump -i vlan2120 host 10.192.20.9
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vlan2120, link-type EN10MB (Ethernet), snapshot length 262144 bytes
06:49:41.185689 IP cloudcontrol2001-dev.codfw.wmnet.46706 > bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org.ssh: Flags [S], seq 2560999656, win 42340, options [mss 1460,sackOK,TS val 98800932 ecr 0,nop,wscale 9], length 0
06:49:41.186527 IP bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org.ssh > cloudcontrol2001-dev.codfw.wmnet.46706: Flags [S.], seq 56257161, ack 2560999657, win 43440, options [mss 1460,sackOK,TS val 2371059922 ecr 98800932,nop,wscale 9], length 0
  • however, I suspect this return traffic is not allowed in the operations/homer/public.git repo for the switches. In particular, cloudgw200X-dev servers are both connected to asw.

What I would like to happen is:

  • as first action item, evaluate if cloudgw servers needs to be connected to asw (instead of only cloudsw)
  • second, think how to integrate cloudgw with the cloud-private subnet, so this flow circulates natively using cloud-private instead of the production circuit.
  • third, evaluate updates to the filtering policy in homer

We will need some help by @cmooney for dealing with all that.

How was this working before? My guess is that given previously cloudcontrol2001-dev had a public IPv4 from the wikimedia.org domain the flow was hitting a different policy in operations/homer/public.git.

Prod private hosts don't have "internet" access, except through the proxies. And bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org/185.15.57.2 is considered as "internet". On the other hand, prod public hosts do (as they directly have a public IP.

One option to solve this is to use the proxies, as SSH support has been implemented in this commit, but iirc that was already a workaround so not sure if that's the cleanest path forward.

Other option is to either trunk cloud-private to cloudgw*, or cloud-instances to cloudcontrol* (assuming cloudcontrol2001 needs access to a VM and not just the bast host). Which have the benefit of keeping the traffic local to the cloud realm.

Could you tell us more about that flow?

Other option is to either trunk cloud-private to cloudgw*, or cloud-instances to cloudcontrol* (assuming cloudcontrol2001 needs access to a VM and not just the bast host). Which have the benefit of keeping the traffic local to the cloud realm.

Could you tell us more about that flow?

Yes I think this is the way forward.

The nova-fullstack process that this ticket refers to is a monitoring thing that exercises the whole openstack VM lifecycle, including SSH'ing to a newly created VM, deleting it, etc.

Change 922104 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: refator to set up routes for the cloud realm independently from keepalived

https://gerrit.wikimedia.org/r/922104

Change 922105 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: refactor vlan interfaces to use interface::tagged

https://gerrit.wikimedia.org/r/922105

Change 922106 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: add cloud-private subnet support

https://gerrit.wikimedia.org/r/922106

Let's take a step back.

Current issue

How was this working before? My guess is that given previously cloudcontrol2001-dev had a public IPv4 from the wikimedia.org domain the flow was hitting a different policy in operations/homer/public.git.

Yes, traffic from the public IPs of those boxes is routed via the cloudsw (cloud vrf) and towards the VIP shared by the cloudgw's (208.80.153.190, currently cloudgw2003-dev). The cloudgw routes this forward to the VIP shared by the cloudnet hosts (185.15.57.10, currently cloudnet2005-dev), it's not blocking that in the nft forward chain.

As Arzhel points out the issue is return traffic from the cloudnet 185.15.57.2/32 is not permitted to the 10.x range by the cloud-in filter on the core routers. It gets to the core router and no further, which is where the traffic breaks down.

What is the plan??

This is something I think we need to discuss in our meeting on Friday, I've added an item to the doc for us to discuss. In general we have two basic options for it:

  1. We treat this traffic as 'internet-bound' traffic, as the destination IP is public, and consider it like any other external IP.
    1. The proxy is probably the way to provide internet access, as needed, from prod realm / cloud-hosts vlan on 10.x
  2. We treat this traffic as internal clould realm traffic, and the cloudcontrol servers should use their connection to the cloud-private vlan to send it

I think option 2 makes more sense here. Assuming we are going that way, then:

  • The cloudgw's are, and will remain, where traffic for IPs on or behind the cloudnet / Neutron routers is sent from the cloud vrf on switches.
  • The cloudgw's will ultimately have a leg in cloud-private, this will be their connection to the cloud vrf on the switches, and will replace vlan 2120 (208.80.153.184/29).
    • As per the plan they will announce the cloud-instance VM range to the cloudsw over this vlan in BGP, making it reachable from the cloud vrf on the switches
    • CloudGW is responsible for filtering / controlling what is allowed between cloud-private and the VM ranges
    • The CloudGW should also announce IP space covering 185.15.57.2, and any other networks it is routing to cloudnet, acting as the bridge between cloud-private and all networks managed by Neutron

Immediate steps

I'm not sure what should be done in the short term to fix this. We probably shouldn't rush in the changes to the cloudgw connectivity without careful planning. It'll need a new Bird template as there is more going on (vrfs etc).

In general I think we need to move more cautiously here anyway. The pattern of making network changes, then waiting for other team members to have problems, and open individual troubleshooting tickets on each, isn't working well.

Hopefully we can address that in the meeting Friday and have a better connected plan where everyone knows what to expect. In the meantime we can maybe try to see what we can do to unblock stuff.

@aborrero I've -1'd that proposed change. Think we're getting a little ahead of ourselves.

What I've done for now is added a manual static route on cloudcontrol2001-dev:

cmooney@cloudcontrol2001-dev:~$ sudo ip route add 185.15.57.0/29 via 172.20.5.1

That's for the /29 range that is routed via cloudgw at the moment. This allows the connection to work:

cmooney@cloudcontrol2001-dev:~$ telnet 185.15.57.2 22
Trying 185.15.57.2...
Connected to 185.15.57.2.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.9p1 Debian-10+deb10u2

We can discuss the wider problem when you're back.

Change 923324 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud_private: route the whole cloud public IPv4 space to cloudsw

https://gerrit.wikimedia.org/r/923324

Change 923324 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud_private: route the whole cloud public IPv4 space to cloudsw

https://gerrit.wikimedia.org/r/923324

topranks> Cathal Mooney Pings are being blocked by 185.15.57.5 itself it seems:
1:39 PM https://www.irccloud.com/pastebin/TZm8TF4e/
Plain Text • 4 lines raw | line numbers
1:40 PM i.e. they are getting there but it's sending unreachable messages back
1:40 PM traffic does seem to get beyond the cloudgw
1:41 PM https://www.irccloud.com/pastebin/Dki06Xhv/
Plain Text • 8 lines raw | line numbers
1:46 PM They seem to be making it to cloudnet/neutron, which is generating the rejects:
1:46 PM https://www.irccloud.com/pastebin/PZTwUGW0/
Plain Text • 9 lines raw | line numbers
1:47 PM Not sure if that helps. What I can say is that nothing here is using the 172.20.x addressing, or this is not being affected by the new cloud-private networking.
1:47 PM cloudweb, cloudgw and cloudnet are on their existing addresses that they were prior to starting any of this
1:49 PM Seems there is a NAT rule to forward this traffic to/from VM IP 172.16.128.97
1:49 PM But that IP is unreachable from the cloudnet for some reason
1:50 PM root@cloudnet2005-dev:/home/cmooney# ip neigh show 172.16.128.97
1:50 PM 172.16.128.97 dev qr-21e10025-d4 FAILED
1:51 PM It can ping other VMs so I think the issue isn't with cloudnet2005 connection to the instance vlan
1:51 PM https://www.irccloud.com/pastebin/vSa9SoOM/
Plain Text • 4 lines raw | line numbers
1:53 PM TL;DR - I don't think this is a physical network issue, and it's not using any of the new components
1:57 PM cloudnet2005-dev can't reach VM tools-codfw1dev-bastion-2 for some reason

Just to clarify the above log relates to an issue connecting to manila-sharecontroller.cloudinfra-codfw1dev (185.15.57.5) from cloudweb2002-dev (208.80.153.41).

That is unrealted to the problems discussed in this task, about the moved cloudcontrol hosts and new networking (cloud-private) introduced as part of the cloudlb project. The issue seems to be that tools-codfw1dev-bastion-2 (172.16.128.97) is unreachable. The health of that VM should be checked, but either way I'd say it best to open a separate task about that, and keep this one about the cloudcontrol / networking issue (which I believe is resolved).

Mentioned in SAL (#wikimedia-cloud) [2023-06-05T09:40:59Z] <arturo> [codfw1dev] rebooting bastion-codfw1dev-02 (no IP address in the main interface) T336963

aborrero triaged this task as High priority.Jun 5 2023, 9:49 AM
aborrero moved this task from Inbox to Soon! on the cloud-services-team board.

The most likely problem here is that rabbbitmq's new addresses are unreachable by cloudvirts. This prevents all nova-compute bits (including network setup) from being effective.

The solution is to add cloud-private to the hypervisors.

The most likely problem here is that rabbbitmq's new addresses are unreachable by cloudvirts. This prevents all nova-compute bits (including network setup) from being effective.

The solution is to add cloud-private to the hypervisors.

Tracking this work in T338125: cloudvirt: connect them to cloud-private

Seeing now this @ cloudvirt2001-dev

2023-06-06 12:28:08.643 2390 WARNING keystoneauth.identity.generic.base [None req-3c9c712d-4ae9-47f5-9e48-eadd683596a4 - - - - - -] Failed to discover available identity versions when contacting https://openstack.codfw1dev.wikimediacloud.org:25357/v3. Attempting to parse version from URL.: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://openstack.codfw1dev.wikimediacloud.org:25357/v3: HTTPSConnectionPool(host='openstack.codfw1dev.wikimediacloud.org', port=25357): Max retries exceeded with url: /v3 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6a72942fa0>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH'))

However, the connection is allowed network-wise:

aborrero@cloudvirt2001-dev:~ $ telnet openstack.codfw1dev.wikimediacloud.org 25357
Trying 185.15.57.24...
Connected to openstack.codfw1dev.wikimediacloud.org.
Escape character is '^]'.
GET /v3/
Connection closed by foreign host.

The problem may be a missing ACL elsewhere.

Ok, the problem now seems that designate is misbehaving and the VMs (which are otherwise just running fine) can't do basic stuff like DNS or LDAP.

Ok, the problem now seems that designate is misbehaving and the VMs (which are otherwise just running fine) can't do basic stuff like DNS or LDAP.

This can be explained by an asymmetric routing between the cloud VMs and the DNS:

aborrero@cloudservices2005-dev:~ $ sudo tcpdump -i any -n host nat.cloudgw.codfw1dev.wikimediacloud.org
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
15:16:09.070525 eno1  In  IP 185.15.57.1.42602 > 208.80.153.50.53: 36017+ A? ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud. (70)
15:16:09.070526 eno1  In  IP 185.15.57.1.42602 > 208.80.153.50.53: 60075+ AAAA? ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud. (70)

aborrero@cloudservices2005-dev:~ $ ip route get 185.15.57.1 from 208.80.153.50
185.15.57.1 from 208.80.153.50 via 172.20.5.1 dev vlan2151 uid 18194 
    cache 

Note how packets arrive via eno1 but are expected to be routed back using vlan2151.

This suggests that we should get these 2 done:

and until then we won't get a smooth experience with VM networking in codfw1dev.

Note how packets arrive via eno1 but are expected to be routed back using vlan2151

FWIW whether the traffic routes out via the asw (eno1) or cloudsw (eno2) it should still be forwarded back through the cloudgw via the same interface and with same packet header 5-tupple, and thus match the right NAT translation etc. So the asymmetric routing itself should not break this.

The problem is the uRPF packet filter on the cloudservices nodes. This is enabled on all relevant interfaces, which means that the device drops in-bound traffic on eno1 with a source IP address of 185.15.57.1, as it's route to that network is via vlan2151@eno2. Disabling this enabled the return traffic:

sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.eno1.rp_filter=0
sysctl -w net.ipv4.conf.eno2.rp_filter=0
sysctl -w net.ipv4.conf.vlan2151.rp_filter=0

After which the traffic is allowed in eno1, and we see replies go out the other interface:

root@cloudservices2005-dev:~# tcpdump -i eno2 -l -p -nn host 185.15.57.1
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno2, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:49:40.199493 IP 208.80.153.50.53 > 185.15.57.1.44661: 53317 NXDomain 0/1/0 (166)
15:49:40.490718 IP 208.80.153.50.53 > 185.15.57.1.60758: 39222 1/0/0 A 185.15.57.24 (72)
15:49:40.490756 IP 208.80.153.50.53 > 185.15.57.1.60758: 64212 0/1/0 (117)
15:49:40.609770 IP 208.80.153.50.53 > 185.15.57.1.33504: 34421 1/0/0 A 185.15.57.24 (72)
15:49:40.609796 IP 208.80.153.50.53 > 185.15.57.1.33504: 56396 0/1/0 (117)

DNS seems to work where it previously had failed:

root@cloudnet2005-dev:~# dig +noall +answer -b 172.16.128.1 www.google.ie @208.80.153.50
www.google.ie.		292	IN	A	142.251.116.94

What made this confusing is you see the packets coming in on eno1 in the tcpdump, but they are actually dropped before making it to the DNS, NTP or other daemon by these rp_filter rules.

This works now! Please reopen if required.