Page MenuHomePhabricator

cloudlb: create PoC on codfw
Closed, ResolvedPublic

Description

This task is to track work to create a cloudlb project proof-of-concept in codfw.

We agreed on re-using the old cloudgw2001-dev server for early testing before we get proper hardware for it.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -1
operations/puppetproduction+1 -3
operations/puppetproduction+1 -0
operations/puppetproduction+7 -5
operations/puppetproduction+16 -0
operations/dnsmaster+3 -1
operations/puppetproduction+31 -11
operations/puppetproduction+89 -60
operations/puppetproduction+4 -3
operations/puppetproduction+2 -0
operations/puppetproduction+25 -15
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+17 -9
operations/puppetproduction+11 -9
operations/puppetproduction+19 -19
operations/puppetproduction+3 -5
operations/homer/publicmaster+15 -0
operations/puppetproduction+5 -3
operations/puppetproduction+8 -6
operations/puppetproduction+3 -4
operations/puppetproduction+26 -8
operations/puppetproduction+2 -2
operations/puppetproduction+12 -0
operations/puppetproduction+15 -3
operations/puppetproduction+2 -2
operations/puppetproduction+28 -9
operations/puppetproduction+76 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -5
operations/puppetproduction+136 -36
operations/homer/publicmaster+6 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -1
operations/puppetproduction+871 -2
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved ayounsi
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
ResolvedJhancock.wm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
InvalidNone
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero
Invalidaborrero
Resolvedaborrero
Resolvedfgiunchedi
Resolvedaborrero
Invalidaborrero
Resolvedaborrero
Resolvedcmooney
ResolvedJhancock.wm
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Opencmooney
Resolvedaborrero
Opencmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenNone
ResolvedAndrew
ResolvedAndrew
Resolvedaborrero

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

+1 on the VIP ranges, we can reserve them at least. I think the public IPs are what we're interested the most in for now, so we can focus on this.

Whatever public range we use for the VIPs it makes sense to add a static to the cloudvirts, and other cloud servers, for that same public range going to the cloud-private subnet. This will mean traffic goes direct from cloud hosts to cloudlb over the cloud-private Vlan, rather than following the default out in the production realm through the CRs, and then back to the switches via the cloud vrf. i.e.
ip route 185.15.57.24/29 via 172.20.5.1

I'm slightly on the fence here and I'm wondering if we shouldn't/couldn't do that routing on the cloudsw instead (between the vrfs).
Multiple routes on servers eventually end up with asymmetric routing one day or the other (eg. if the traffic originates from a different IP than the interface one on the servers).

I'm slightly on the fence here and I'm wondering if we shouldn't/couldn't do that routing on the cloudsw instead (between the vrfs).

I feel route-leaking on the cloudsw would be a total violation of the whole concept of having two realms, and a dangerous and complicated config to start adding.

Multiple routes on servers eventually end up with asymmetric routing one day or the other (eg. if the traffic originates from a different IP than the interface one on the servers).

I don't think this is much of a concern here. Ultimately with hosts connected to multiple networks, as we have, it'll always be something to consider (statics or not), but the additional routes seem simple and straightforward to me. It's been part of the design for this since day one (see https://w.wiki/6WPR).

There really is no other way for the separate cloud subnets per rack (a complexity for wmcs we are insisting on), and also keeping the inter-rack cloud-private traffic within the cloud realm. Of course there are things like network namespaces or vrf's on the hosts, but I think that is a serious additional layer of complexity, and I suspect more likely to result in strange routing situations.

I can't see how in normal circumstances we'd ever get asymmetric routing here, unless someone got very creative on the hosts.

If it does happen things shouldn't work. The 172.20.0.0/16 networks will not be reachable from the prod realm. We should add to the 'labs-in' filter on the CRs to block traffic from the cloud 10.x prod realm IPs to the public VIP ranges just in case. Just as protection.

@ayounsi just looking at the bird anycast template in puppet, I think the "vips_filter" is potentially not going to allow /32s from 185.15.57.24/29 or 172.20.254.0/24 to be announced?

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/bird/templates/bird_anycast.conf.erb#52

On the Bird config, we should use MEDs instead of prepending to the secondary node

Sorry not sure how I missed this comment before. We'll need to use pre-pending here as there is EBGP between cloudsw-c8/d5 and cloudsw-e4/f4. MED is non-transitive so it won't work. We'll need 2 pre-pends so c8/d5 will see a route from a primary connected to e4/f4 as better (1 as-hop away vs 2).

But first, do we need to distinguish between a primary and secondary router?

@aborrero correct me if I'm wrong but primary/secondary is a requirement here is it not? I can imagine a scenario:

  • External connection comes in from internet to cr1-eqiad, gets handed off to cloudsw1-c8-eqiad and sent to directly connected cloudlb1001
  • Some external change on the wider internet (transit etc.) causes subsequent packets in this flow to arrive to cr2-eqiad
  • cr2-eqiad in turn sends the traffic to cloudsw1-d5-eqiad, which in turn sends to directly connected cloudlb1002
  • When the packets hit cloudlb1002 it has no knowledge of existing backend server, no NAT conntrack for the flow
  • Traffic gets sent to a different back-end server as a result, breaking the session?

Unless there is a way to synchronize the HAproxy states between the two cloudlbs? Which would allow active/active as no matter which cloudlb traffic arrives to it'll always to the same backend?

Given these differences I'm wondering if we aren't better defining a different $config_template for the cloudlb's separate to the one we use for the anycast hosts? I can work on that if needed.

But first, do we need to distinguish between a primary and secondary router?

@aborrero correct me if I'm wrong but primary/secondary is a requirement here is it not? I can imagine a scenario:

  • External connection comes in from internet to cr1-eqiad, gets handed off to cloudsw1-c8-eqiad and sent to directly connected cloudlb1001
  • Some external change on the wider internet (transit etc.) causes subsequent packets in this flow to arrive to cr2-eqiad
  • cr2-eqiad in turn sends the traffic to cloudsw1-d5-eqiad, which in turn sends to directly connected cloudlb1002
  • When the packets hit cloudlb1002 it has no knowledge of existing backend server, no NAT conntrack for the flow
  • Traffic gets sent to a different back-end server as a result, breaking the session?

Unless there is a way to synchronize the HAproxy states between the two cloudlbs? Which would allow active/active as no matter which cloudlb traffic arrives to it'll always to the same backend?

Given these differences I'm wondering if we aren't better defining a different $config_template for the cloudlb's separate to the one we use for the anycast hosts? I can work on that if needed.

Yes, that scenario breakdown seems correct to me. Apparently HAproxy supports state synchronization, even though I don't think we have ever used it. Per the linked docs, it feels simple to configure, but I know from other stuff (netfilter and conntrackd) that it can introduce complexity later when debugging failures, etc.

So do you agree with my read of the multi-master situation?

  • if using multi-master: it makes it simpler for the BGP implementation (reuses most of the anycast profile/templates in puppet), involves new and untested setup of HAproxy
  • if not using multi-master: involves forking the BGP profile/templates in puppet away from the anycast code, makes it simpler for the HAproxy setup.

Change 904518 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] profile::bird::anycast: add template parameter

https://gerrit.wikimedia.org/r/904518

Change 904745 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Bird: POC use a different ASN for Cloud hosts

https://gerrit.wikimedia.org/r/904745

I sent https://gerrit.wikimedia.org/r/904745 to implement a different ASN. But thinking more about it I'm not sure this is needed.
We won't do anything with this information, the ASN won't propagate outside of the WMCS realm, and it adds special cases. Happy to discuss.

@ayounsi just looking at the bird anycast template in puppet, I think the "vips_filter" is potentially not going to allow /32s from 185.15.57.24/29 or 172.20.254.0/24 to be announced?

Indeed! We can get rid of that safeguard before it grows out of hands. We already filter on the network side and the prefixes we want to advertise are explicitly defined in Puppet so AFAIK no risk of rogue prefix.

On the MED/prepending/etc routing, etc I was reading this task as being focused on the codfw POC, where all the servers are on the same switch, so out of scope here but indeed to take into consideration when we export it to eqiad.

Change 904754 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Bird: remove anycast subnet filter

https://gerrit.wikimedia.org/r/904754

I sent https://gerrit.wikimedia.org/r/904745 to implement a different ASN. But thinking more about it I'm not sure this is needed.
We won't do anything with this information, the ASN won't propagate outside of the WMCS realm, and it adds special cases. Happy to discuss.

Yeah that's ok, as you say won't appear on the CRs. If we have reason in future to use a unique one we can.

@ayounsi just looking at the bird anycast template in puppet, I think the "vips_filter" is potentially not going to allow /32s from 185.15.57.24/29 or 172.20.254.0/24 to be announced?

Indeed! We can get rid of that safeguard before it grows out of hands. We already filter on the network side and the prefixes we want to advertise are explicitly defined in Puppet so AFAIK no risk of rogue prefix.

Sounds good, and yeah the filter already explicitly matches the /32s, so it's just an additional safeguard in case someone adds a /32 from the wrong range. Fairly safe to remove I think.

On the MED/prepending/etc routing, etc I was reading this task as being focused on the codfw POC, where all the servers are on the same switch, so out of scope here but indeed to take into consideration when we export it to eqiad.

Yeah fair enough. We could poentially add a var for 'prepend', in addition to the 'deterministic' one in your patch. To be discussed again. Could also make cloudsw1-c8-eqiad and cloudsw1-d5-eqiad route reflectors, and change the EBGP to cloudsw1-e4-eqiad and cloudsw1-f4-eqiad to IBGP clients of those. In which case MED would work.

Typically I've always preferred pre-pends, as they are very obvious when looking at routes and work regardless of EBGP/IBGP. But having two separate ways to express preference is maybe not ideal, and I know we use MED on PyBal etc. already.

For now let's proceed with the existing anycast config, to validate the concept and function of the load-balancer side. We can tweak the setup to prep for the additional challenges having multiple cloudsw in eqiad bring when we are happy with the basics.

Change 904745 abandoned by Ayounsi:

[operations/puppet@production] Bird: POC use a different ASN for Cloud hosts

Reason:

https://gerrit.wikimedia.org/r/904745

Change 868731 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: introduce BGP setup by means of bird

https://gerrit.wikimedia.org/r/868731

Change 904754 merged by Ayounsi:

[operations/puppet@production] Bird: remove anycast subnet filter

https://gerrit.wikimedia.org/r/904754

Change 903622 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud_private_subnet: add route to public IPv4 range

https://gerrit.wikimedia.org/r/903622

Change 903623 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud_private_subnet: codfw: relocate some hiera

https://gerrit.wikimedia.org/r/903623

fnegri raised the priority of this task from Medium to High.Apr 12 2023, 2:52 PM

@aborrero great work with on the Bird anycast. I can see the conf is there and added a basic bgp peering on the clousw1-b1-codfw side to peer with cloudlb2001-dev.

Unfortunately the session has not established. The reason for this is that the bird template has tried to create 2 BGP sessions, one to each of the core routers in codfw, rather than a single session to the cloudsw itself on 172.20.5.1:

root@cloudlb2001-dev:/etc/bird# grep neighbor bird.conf 
    neighbor 208.80.153.192 external;
    neighbor 208.80.153.193 external;

This comes directly from /hieradata/codfw/profile/bird.yaml. Ultimately we need a way to specify a list of neighbors in a different way to accomodate this scenario. I need to dig a little deeper, it seems in drmrs the doh600x VMs know to peer directly with the single top-of-rack switch, but there is no bird.yaml in hieradata for drmrs.

Another thing that is not being set is the IP to announce. It's defaulting to the 203.0.113.1/32 dummy IP, rather than say 185.15.57.24/32.

Change 916464 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] clod_private_subnet: fix BGP neighbors

https://gerrit.wikimedia.org/r/916464

Change 916464 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] clod_private_subnet: fix BGP neighbors

https://gerrit.wikimedia.org/r/916464

Another thing that is not being set is the IP to announce. It's defaulting to the 203.0.113.1/32 dummy IP, rather than say 185.15.57.24/32.

I couldn't find any where in puppet or the config files to set this up, beyond the VIP address on loopback with scope global lo:anycast which puppet already does.

Manually running the check command from /etc/anycast-healthchecker.d/hc-vip-openstack.codfw1dev.wikimediacloud.org.conf

Returns:

connect to address 185.15.57.24 and port 443: Connection refused
HTTP CRITICAL - Unable to open TCP socket

Change 916519 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: update BGP anycast-healthcheck

https://gerrit.wikimedia.org/r/916519

Change 916519 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: update BGP anycast-healthcheck

https://gerrit.wikimedia.org/r/916519

Manually running the check command from /etc/anycast-healthchecker.d/hc-vip-openstack.codfw1dev.wikimediacloud.org.conf

Returns:

connect to address 185.15.57.24 and port 443: Connection refused
HTTP CRITICAL - Unable to open TCP socket

took me a while to discover this was the problem. Documented it here for posterity: https://wikitech.wikimedia.org/wiki/Anycast#VIP_not_being_announced_by_BGP

TODO:

  • proper VIP healthcheck, possibly checking for HAproxy having more than 0 backends alive
  • route for return traffic. Today traffic reaching the VIP don't return using the same link (default route in the wiki prod realm).

Change 917302 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: introduce haproxy check for the BGP VIP

https://gerrit.wikimedia.org/r/917302

Change 917302 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: introduce haproxy check for the BGP VIP

https://gerrit.wikimedia.org/r/917302

Change 917329 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] haproxy: check_haproxy: introduce new check mode --check=someup

https://gerrit.wikimedia.org/r/917329

Change 917369 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add policy for cloudsw BGP peering to cloudlb and other cloud servers

https://gerrit.wikimedia.org/r/917369

Change 917369 merged by jenkins-bot:

[operations/homer/public@master] Add policy for cloudsw BGP peering to cloudlb and other cloud servers

https://gerrit.wikimedia.org/r/917369

Change 917329 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] haproxy: check_haproxy: introduce new check mode --check=someup

https://gerrit.wikimedia.org/r/917329

Change 918419 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: disable HAproxy config for IPv6

https://gerrit.wikimedia.org/r/918419

Change 918419 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: use dnsquery::lookup()

https://gerrit.wikimedia.org/r/918419

Change 918517 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: haproxy: drop support for IPv6

https://gerrit.wikimedia.org/r/918517

Change 918517 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: haproxy: drop support for IPv6

https://gerrit.wikimedia.org/r/918517

Change 918523 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: haproxy: http-service.cfg.erb: fix template

https://gerrit.wikimedia.org/r/918523

Change 918523 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: haproxy: http-service.cfg.erb: fix template

https://gerrit.wikimedia.org/r/918523

Change 919292 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] network: introduce cloud-private-b1-codfw subnet

https://gerrit.wikimedia.org/r/919292

Change 919292 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] network: introduce cloud-private-b1-codfw subnet

https://gerrit.wikimedia.org/r/919292

Change 919298 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] network: data: add cloud codfw1dev 185.15.57.24/29

https://gerrit.wikimedia.org/r/919298

Change 919298 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] network: data: add cloud codfw1dev 185.15.57.24/29

https://gerrit.wikimedia.org/r/919298

Change 919342 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Andrew Bogott):

[operations/puppet@production] Openstack galera/mariadb grants: allow access via haproxy nodes

https://gerrit.wikimedia.org/r/919342

Change 919342 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack galera/mariadb grants: allow access via haproxy nodes

https://gerrit.wikimedia.org/r/919342

Change 919342 merged by Andrew Bogott:

[operations/puppet@production] Openstack galera/mariadb grants: allow access via haproxy nodes

https://gerrit.wikimedia.org/r/919342

Change 919352 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudservices: codfw1dev: enable cloud-private subnet

https://gerrit.wikimedia.org/r/919352

Change 920291 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add a new aggregate network for the cloud-private 'supernet'

https://gerrit.wikimedia.org/r/920291

Change 920291 abandoned by Cathal Mooney:

[operations/puppet@production] Add a new aggregate network for the cloud-private 'supernet'

Reason:

we're gonna deal with this another way and review when cloudlb poc is done

https://gerrit.wikimedia.org/r/920291

Change 919352 merged by Andrew Bogott:

[operations/puppet@production] cloudservices: codfw1dev: enable cloud-private subnet

https://gerrit.wikimedia.org/r/919352

Change 923551 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud_private_subnet: split BGP code into separate profile

https://gerrit.wikimedia.org/r/923551

Change 923552 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs

https://gerrit.wikimedia.org/r/923552

Change 923551 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud_private_subnet: split BGP code into separate profile

https://gerrit.wikimedia.org/r/923551

Change 923552 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud_private_subnet::bgp: set up route lookup rule only for /32 VIPs

https://gerrit.wikimedia.org/r/923552

Change 924526 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: adjust openstack.codfw1dev FQDN

https://gerrit.wikimedia.org/r/924526

Change 924526 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: adjust openstack.codfw1dev FQDN

https://gerrit.wikimedia.org/r/924526

Change 904518 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] profile::bird::anycast: add template parameter

Reason:

not required at the moment

https://gerrit.wikimedia.org/r/904518

Mentioned in SAL (#wikimedia-cloud) [2023-06-12T11:57:29Z] <arturo> [codfw1dev] refresh various occurrences of old FQDNs in instance puppet via horizon (T324992)

Change 929666 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Disable multihop BGP for cloud hosts connected directly to cloudsw

https://gerrit.wikimedia.org/r/929666

Change 929666 merged by Cathal Mooney:

[operations/puppet@production] Disable multihop BGP for cloud hosts connected directly to cloudsw

https://gerrit.wikimedia.org/r/929666

Change 936235 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: codfw: use someup check for haproxy BGP check

https://gerrit.wikimedia.org/r/936235

Change 936235 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: codfw: use someup check for haproxy BGP check

https://gerrit.wikimedia.org/r/936235

Change 940321 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] acme_chief: openstack-codf1dev: drop cloudcontrol access

https://gerrit.wikimedia.org/r/940321

Change 940321 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] acme_chief: openstack-codf1dev: drop cloudcontrol access

https://gerrit.wikimedia.org/r/940321