Page MenuHomePhabricator

CloudVPS: enable BGP in the neutron transport network
Open, Stalled, MediumPublic

Description

This task is to track the work for enabling BGP between the neutron virtual router and core routers.

We will do first codfw1dev and if everything works out as expected, then eqiad1.

Docs: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Network_refresh#BGP_in_the_transport_network

Event Timeline

jbond triaged this task as Medium priority.Feb 19 2020, 1:43 PM
jbond subscribed.

Mentioned in SAL (#wikimedia-cloud) [2020-02-20T13:33:43Z] <arturo> [codfw1dev] disable puppet in cloudnet servers to hack neutron.conf for tests related to T245606

Mentioned in SAL (#wikimedia-cloud) [2020-02-20T13:35:43Z] <arturo> [codfw1dev] disable puppet in cloudcontrol servers to hack neutron.conf for tests related to T245606

Additional tests related to this are blocked on missing backported packages for the stretck-pike combo: python3-os-ken and neutron-dynamic-routing (pike == v11). I already contacted upstream folks (Debian) to see if we can move this forward.

Mentioned in SAL (#wikimedia-cloud) [2020-02-21T11:49:26Z] <arturo> [codfw1dev] rename neutron address scope no-nat to bgp (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-21T11:51:36Z] <arturo> [codfw1dev] create a neutron subnet pool per each subnet objects we have and manually update DB to inter-associate them (T245606)

python3-os-ken is not required for openstack pike. The ryu driver is used instead (and is available).

When trying to create the bgp speaker I found this issue:

2020-02-21 11:54:04.537 6951 ERROR neutron.api.v2.resource [req-81b09e12-046b-421c-86f1-09c9197d1474 novaadmin admin - default default] create failed: No details.: OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column 'project_id' in 'field list'") [SQL: u'INSERT INTO bgp_speakers (project_id, id, name, local_as, advertise_floating_ip_host_routes, advertise_tenant_networks, ip_version) VALUES (%s, %s, %s, %s, %s, %s, %s)'] [parameters: ('admin', '0a06502d-b164-4fc6-8e67-ddefda69b005', 'bgpspeaker', u'64711', 1, 1, 4)]

which may indicate again a version mismatch between the library and the database schema.

When trying to create the bgp speaker I found this issue:

2020-02-21 11:54:04.537 6951 ERROR neutron.api.v2.resource [req-81b09e12-046b-421c-86f1-09c9197d1474 novaadmin admin - default default] create failed: No details.: OperationalError: (_mysql_exceptions.OperationalError) (1054, "Unknown column 'project_id' in 'field list'") [SQL: u'INSERT INTO bgp_speakers (project_id, id, name, local_as, advertise_floating_ip_host_routes, advertise_tenant_networks, ip_version) VALUES (%s, %s, %s, %s, %s, %s, %s)'] [parameters: ('admin', '0a06502d-b164-4fc6-8e67-ddefda69b005', 'bgpspeaker', u'64711', 1, 1, 4)]

which may indicate again a version mismatch between the library and the database schema.

This happened because last time we did the schema upgrade we didn't have the python libs for BGP. Once the python libs for BGP are installed the schema should be updated again with sudo neutron-db-manage upgrade head in coudcontrol2002-dev.

Mentioned in SAL (#wikimedia-cloud) [2020-02-21T12:46:27Z] <arturo> [codfw1dev] created bgpspeaker for AS64711 (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-21T12:48:17Z] <arturo> [codfw1dev] running root@cloudcontrol2001-dev:~# neutron bgp-speaker-network-add bgpspeaker wan-transport-codfw (T245606)

Apparently the config makes sense, this is what the bgp speaker would advertise:

root@cloudcontrol2001-dev:~# neutron bgp-speaker-advertiseroute-list bgpspeaker
+-----------------+----------------+
| destination     | next_hop       |
+-----------------+----------------+
| 185.15.57.2/32  | 208.80.153.190 |
| 172.16.128.0/24 | 208.80.153.190 |
+-----------------+----------------+

which seems correct:

  • a floating IP allocated to an instance (there is other floating IP which is not advertised because it is not associated with a VM)
  • the flat LAN network for VM instances

I've been reading the linked proposal and noticed this:
"the internal flat network CIDR. This is 172.16.0.0/21 in eqiad1 and 172.16.128.0/24 in codfw1dev."
I see hieradata/codfw/profile/openstack/codfw1dev/neutron.yaml has that as a /24 but I see several other references in puppet to it being /21?

I've been reading the linked proposal and noticed this:
"the internal flat network CIDR. This is 172.16.0.0/21 in eqiad1 and 172.16.128.0/24 in codfw1dev."
I see hieradata/codfw/profile/openstack/codfw1dev/neutron.yaml has that as a /24 but I see several other references in puppet to it being /21?

Fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/574400 thanks!

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T10:56:57Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.185 --remote-as 65002 bgppeer (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T10:59:32Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker bgppeer (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T12:06:57Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-peer-delete 17b8c2a3-f0ce-4d50-a265-18ccac703c61 (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T12:09:12Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.186 --remote-as 65002 cr1-codfw (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T12:09:22Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-peer-create --peer-ip 208.80.153.187 --remote-as 65002 cr2-codfw (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T12:16:17Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr1-codfw (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T12:16:22Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# neutron bgp-speaker-peer-add bgpspeaker cr2-codfw (T245606)

NOTE: apparently the neutron BGP implementation doesn't support ingesting routes using BGP, only advertising. In our neutron setup, the default transport route is set when defining the subnet object.

For the routers to cloudnet hosts traffic, we should only establish the BGP sessions over the transport network, doing it over the hosts vlan would be very hackish (multihop and would not detect a failure of the transport network).

About "doesn't support ingesting routes using BGP" this is also a limitation in term of failovers.
As in an ideal situation, inbound BGP would replace VRRP, if a router were to fail it would stop advertising a default route.
So current situation is still a step forward but not the full solution yet.

Change 574452 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/dns@master] codfw: cloudnet: allocate addresses in the cloud transport network

https://gerrit.wikimedia.org/r/574452

Change 574452 merged by Arturo Borrero Gonzalez:
[operations/dns@master] codfw: cloudnet: allocate addresses in the cloud transport network

https://gerrit.wikimedia.org/r/574452

I'm seeing this in the openstack BGP speaker:

2020-03-03 12:29:28.322 2724 ERROR neutron_dynamic_routing.services.bgp.agent.bgp_dragent [req-c01d353e-7b54-431b-8892-3030c2bb0fe2 - - - - -] Call to driver for BGP Speaker 0ef14753-efb6-483d-8ebf-a21262ded8d5 add_bgp_peer has failed with exception 'auth_type'.

Upon search I found an upstream patch with a potential fix for Openstack Pike: https://review.opendev.org/#/c/545783/

I submitted a merge request for the Debian package here:
https://salsa.debian.org/openstack-team/services/neutron-dynamic-routing/-/merge_requests/1

Anyway we might upgrade to Openstack Queens soon, which should include the fix.

Fixed package is neutron-bgp-dragent_11.0.0-2~bpo9+1 and friends.

hey @ayounsi do you have to enable anything in your side for BGP to work? I see something weird, I get a no route to host error here:

aborrero@cloudnet2002-dev:~ 255 $ ip a show dev eno2.2120
7: eno2.2120@eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-external state UP group default qlen 1000
    link/ether 30:e1:71:55:a2:41 brd ff:ff:ff:ff:ff:ff
    inet 208.80.153.188/29 scope global eno2.2120
       valid_lft forever preferred_lft forever
aborrero@cloudnet2002-dev:~ $ ip r
default via 10.192.20.1 dev eno1 onlink 
10.192.20.0/24 dev eno1 proto kernel scope link src 10.192.20.10 
208.80.153.184/29 dev eno2.2120 proto kernel scope link src 208.80.153.188 
aborrero@cloudnet2002-dev:~ $ ip r get 208.80.153.187
208.80.153.187 dev eno2.2120 src 208.80.153.188 
    cache 
aborrero@cloudnet2002-dev:~ $ telnet 208.80.153.187 179
Trying 208.80.153.187...
telnet: Unable to connect to remote host: No route to host
aborrero@cloudnet2002-dev:~ $ sudo tcpdump -i eno2.2120 host 208.80.153.187
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eno2.2120, link-type EN10MB (Ethernet), capture size 262144 bytes
12:32:28.907382 ARP, Request who-has ae2-2120.cr2-codfw.wikimedia.org tell eno2-2120.cloudnet2003-dev.wikimedia.org, length 42
12:32:29.931374 ARP, Request who-has ae2-2120.cr2-codfw.wikimedia.org tell eno2-2120.cloudnet2003-dev.wikimedia.org, length 42
12:32:30.955419 ARP, Request who-has ae2-2120.cr2-codfw.wikimedia.org tell eno2-2120.cloudnet2003-dev.wikimedia.org, length 42
[..]

The ARP reply is produced and reach my server, just I don't know yet what's going on, or which interface is this reply packet using:

aborrero@cloudnet2002-dev:~  $ sudo tcpdump -i any host 208.80.153.187
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
12:38:26.090417 ethertype ARP, ARP, Request who-has ae2-2120.cr2-codfw.wikimedia.org tell eno2-2120.cloudnet2002-dev.wikimedia.org, length 28
12:38:26.090827 ethertype ARP, ARP, Reply ae2-2120.cr2-codfw.wikimedia.org is-at a8:d0:e5:e3:87:c7 (oui Unknown), length 42
12:38:26.090829 ARP, Reply ae2-2120.cr2-codfw.wikimedia.org is-at a8:d0:e5:e3:87:c7 (oui Unknown), length 42
12:38:26.090834 ARP, Reply ae2-2120.cr2-codfw.wikimedia.org is-at a8:d0:e5:e3:87:c7 (oui Unknown), length 42

This is probably related to adding the IP address to my vlan interface inside the bridge.

Forget last 2 comments. I think I can assign the address to the bridge device instead of the vlan device and everything should work as expected.

Change 577232 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/dns@master] codfw: cloudnet: refresh extra address FQDN

https://gerrit.wikimedia.org/r/577232

Mentioned in SAL (#wikimedia-cloud) [2020-03-05T13:06:37Z] <arturo> [codfw1dev] upgrade neutron-dynamic-routing packages in cloudnet200X-dev and cloudcontrol200X-dev servers to 11.0.0-2~bpo9+1 (T245606)

Mentioned in SAL (#wikimedia-cloud) [2020-03-05T13:07:18Z] <arturo> [codfw1dev] move the extra IP address for BGP in cloudnet200x-dev servers from eno2.2120 to the br-external bridge device (T245606)

Change 577232 merged by Arturo Borrero Gonzalez:
[operations/dns@master] codfw: cloudnet: refresh extra address FQDN

https://gerrit.wikimedia.org/r/577232

[edit protocols bgp]
     group Netflow { ... }
+    /* T245606 */
+    group Cloud {
+        import BGP_Cloud_in;
+        family inet {
+            unicast {
+                prefix-limit {
+                    maximum 50;
+                    teardown 80;
+                }
+            }
+        }
+        family inet6 {
+            unicast {
+                prefix-limit {
+                    maximum 50;
+                    teardown 80;
+                }
+            }
+        }
+        export BGP_Default;
+        peer-as 64711;
+        neighbor 208.80.153.188 {
+            description cloudnet2002-dev;
+        }
+        neighbor 208.80.153.189 {
+            description cloudnet2003-dev;
+        }
+    }
[edit policy-options]
    prefix-list fundraising-codfw4 { ... }
+   prefix-list cloud {
+       172.16.128.0/21;
+       185.15.57.0/24;
+   }
[edit policy-options]
+   policy-statement BGP_Cloud_in {
+       term address {                  
+           from {
+               family inet;
+               protocol bgp;
+               prefix-list-filter cloud orlonger;
+           }
+           then accept;
+       }
+       then reject;
+   }

Mentioned in SAL (#wikimedia-operations) [2020-03-05T14:09:28Z] <XioNoX> push BGP to Cloud on cr1-codfw - T245606

Mentioned in SAL (#wikimedia-cloud) [2020-03-05T14:24:28Z] <arturo> [codfw1dev] we just enabled BGP session between cloudnet2xxx-dev and cr1-codfw (T245606)

Mentioned in SAL (#wikimedia-operations) [2020-03-05T14:25:34Z] <XioNoX> push BGP to Cloud on cr2-codfw - T245606

Reviewing this setup again, I just noticed this static route:

route 185.15.57.0/29 next-hop 208.80.153.190;

That CIDR is both the floating IP range and the routing_source_ip address.

Additionally, if you see this:

root@cloudcontrol2001-dev:~# neutron bgp-speaker-advertiseroute-list bgpspeaker
+-----------------+----------------+
| destination     | next_hop       |
+-----------------+----------------+
| 185.15.57.2/32  | 208.80.153.190 | <--- floating IP
| 172.16.128.0/24 | 208.80.153.190 | <--- internal network
+-----------------+----------------+

You can see Neutron generates a prefix to be advertised via BGP per individual floating IP, but it does not generate a route for the routing_source_ip address.
This means we will need to keep the static route in the core routers and all this BGP thing will give us very little benefits (the main one was to avoid maintaining static routes in the core routers).

Wait, I think I may have a solution for this.

Mentioned in SAL (#wikimedia-cloud) [2020-03-18T10:55:36Z] <arturo> [codfw1dev] deleting BGP agent, undoing changes we did for T245606

BGP and firewall filter config removed from codfw's router.

We decided to drop the BGP project for now.

We collected valuable information about the setup, how it works and what we require. So next time we decide introduce this (or do further research) should be much more simpler.

Change 580940 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/dns@master] codfw: openstack: drop unused br-external FQDNs for cloudnet servers

https://gerrit.wikimedia.org/r/580940

Change 580940 merged by Arturo Borrero Gonzalez:
[operations/dns@master] codfw: openstack: drop unused br-external FQDNs for cloudnet servers

https://gerrit.wikimedia.org/r/580940

Reopening this per IRC, and given this is a prod/WMCS task affecting prod in major ways.

First of all, it'd be great to hear a little bit more information on why this was abandoned. Was this infeasible or too complex with Neutron, or a change of direction, etc. Would love to hear what those learnings were too!

In terms of the requirements that were driving the task (separating the networks, moving WMCS under a "customer" model), they are still very much there and they are a hard requirement as far from a prod & SRE team perspective.

If Neutron is limited in some way, we can supplement it with hardware gear (e.g. a pair of dedicated routers). That's something we can totally do, and have been thinking about, and it may make sense from other perspectives as well (= provide network routing for "labs-support" type networks). Before we go through that, it would be great for me to better understand the constraints at play here, though, hence my questions above :)

Hope this all makes sense and happy to discuss further!

  • Neutron BGP is outbound only, so we would still need to keep the VRRP VIP between cr1 and cr2 and a static route from cloud -> core
  • Neutron BGP doesn't allow to setup BGP from an interface managed by Neutron, in this case the cloudnet subinterface with a leg on the transport subnet
  • This means Arturo had to create manual sub interfaces, only for BGP
    • Those interfaces are only for BGP, client traffic still goes through the Neutron managed VIP shared between the two cloudnet
    • They can't have iptables, as Neutron manages iptables, but doesn't manage those interfaces

TLDR:

  • The good side: we can get rid of the static routes from cores -> cloudnet-VIP
  • It requires requires extra firewall policies on the core side (tech debt) to protect those non iptableables IPs/interfaces
  • It doesn't improve failover as the cloudnet and VRRP VIPs need to stay

I *think* the current Neutron BGP implementation is for cloudvirt to peer with the top of rack switches, and not for cloudnet to peer with outside

I hope what @ayounsi said help clarify the situation @faidon. Some additional info about the setup we tested can be found here: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Network_refresh#BGP_in_the_transport_network
BGP is not what we need right now to solve our shortcomings, I believe there are other approaches that could better result in a short/mid term benefit, and get us closer to the customer model you mention.

But I honestly feel that this requires a more "high level" talk and sync. As @bd808 mentioned on IRC, we are just not ready for a complete hardware split. As of today I think we don't have human capacity to manage an increased amount of hardware (servers, switches, routers, whatever). But I would love to :-)

You mention having a cuple of dedicated routers. I totally agree that it could improve the situation in a somewhat incremental way. Is something that I've proposed several times already. My last brain dump is here: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Network_refresh#intermediate_router/firewall From my point of view, again, we should be ready to talk about what this involves from the l2/l3 isolation point of view: install servers, monitoring, switches, vlans, etc.

Summary: +1 to the idea of having a couple of dedicated routers (linux boxes in this case)

BGP is quite a slow protocol, you might want to tweak some timers or combine it with BFD.
If BGP is giving you too much hassle, you might want to consider switching to OSPF. On the Juniper side you should be able to run it in a separate routing-instance to keep it away from your main OSPF process.

aborrero changed the task status from Open to Stalled.Apr 9 2020, 11:02 AM

We are not planning on working on this anytime soon.