Page MenuHomePhabricator

cloudgw: replace keepalived with bird
Open, Stalled, LowPublic

Description

The initial cloudgw implementation used keepalived / VRRP to implement the virtual IPs for the different network gateways.

However, we have recently realized that an anycast/BGP setup would be a bit more robust and could see all the cloudgw servers act as active nodes.

This ticket is to track the work to replace keepalived with bird and switch to BGP instead.

Event Timeline

Change 922104 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: refactor to set up routes independently from keepalived

https://gerrit.wikimedia.org/r/922104

Change 922105 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: refactor vlan interfaces to use interface::tagged

https://gerrit.wikimedia.org/r/922105

Change 922106 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: codfw: add cloud-private subnet support

https://gerrit.wikimedia.org/r/922106

Change 963000 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Move 185.15.57.8/29 to netbox-controlled DNS records

https://gerrit.wikimedia.org/r/963000

Change 963000 merged by Cathal Mooney:

[operations/dns@master] Move 185.15.57.8/29 to netbox-controlled DNS records

https://gerrit.wikimedia.org/r/963000

Change 922104 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: refactor to set up routes independently from keepalived

https://gerrit.wikimedia.org/r/922104

Change 963279 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: interfaces: set up cloudgw <-> cloudnet routes in the right interface

https://gerrit.wikimedia.org/r/963279

Change 963279 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: interfaces: set up cloudgw <-> cloudnet routes in the right interface

https://gerrit.wikimedia.org/r/963279

Change 963283 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: put cloud-realm routes back under keepalived control

https://gerrit.wikimedia.org/r/963283

Change 963283 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: put cloud-realm routes back under keepalived control

https://gerrit.wikimedia.org/r/963283

Change 963311 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: move routes out of keepalived into interfaces

https://gerrit.wikimedia.org/r/963311

Change 922105 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: refactor interfaces setting to use the base module

https://gerrit.wikimedia.org/r/922105

Change 963311 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: move routes out of keepalived into interfaces

https://gerrit.wikimedia.org/r/963311

I think @cmooney can continue this work in the future.

aborrero changed the task status from Open to Stalled.Sep 4 2024, 2:34 PM
aborrero triaged this task as Low priority.

Happy to advise on how to set up Bird on the cloudgw side, but I'm not gonna start merging patches and over-stepping the mark :)

It's a quick thing to set up on the cloudsw whenever we are ready.

One bit of advice on the cloudgw side we should do if we implement this is add a blackhole route with high-metric to the cloud-vrf table:

ip route add vrf vrf-cloudgw blackhole default metric 9999

Without that, if the default from the cloudsw was somehow not presnet, and traffic came in to the cloudgw, there would be potential for the packet to end up routed in the default table (i.e. wmf prod land). This is due to how the "ip rules" that select the VRF work, and the fact the next ip rule is used if no matching route is found in the vrf table. Unlikely but best we are covered.

fgiunchedi renamed this task from cloudgw: replace keepalived with BGP to cloudgw: replace keepalived with bird.Tue, Feb 24, 10:04 AM
fgiunchedi updated the task description. (Show Details)