Page MenuHomePhabricator

Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster
Closed, ResolvedPublic

Description

We wish to enable the Ceph Object Gateway, which is an S3 and Swift compatible object storage interface for Ceph.

This will be a new HTTPS service that is served by the radosgw service one each of the five cephosd100* servers.

We will need this service to be load-balanced across the servers.

Whilst we normally use LVS for load-balancing, this Ceph Object Storage gateway service is a little different because it is intended to be able to support relatively high-throughput transfers. Therefore we wish to avoid making the LVS servers the bottleneck by routing all traffic through those hosts.

The preferred solution would be for us to use Anycast for load-balancing. This will cause the cephosd servers them selves to announce the availability of the service using BGP. This will avoid any unnecessary network hops between the client and the rados gateway.

Event Timeline

Gehel triaged this task as Low priority.Oct 18 2023, 8:49 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.
Gehel subscribed.

We're not working on object storage at the moment, we might reopen those tickets if object storage becomes a priority again

BTullis raised the priority of this task from Low to High.Aug 16 2024, 12:49 PM

I am raising the priority of this, since we now want to integrate the Airflow logs with the S3 interface of our Ceph cluster.

BTullis renamed this task from Configure load-balancing approriate for ceph radosgw services on the data-engineering cluster to Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster.Aug 19 2024, 10:37 AM
BTullis updated the task description. (Show Details)

@cmooney @ayounsi - Are you happy in principle for us to set up this new anycast service for the Ceph Object Gateway, using the steps outlined here? https://wikitech.wikimedia.org/wiki/Anycast#Deploying_a_new_service

@cmooney - Sorry to trouble you, but might you have any update on the feasibility this request for our Ceph/S3 service, please?

I recall that we spoke about this configuration and the various pros and cons of using Anycast (instead of LVS), but I'm not sure if we're supposed to wait for a documentation update or for specific guidance before proceeding. Feel free to suggest any alternatives, if you think that there is a different solution that would better support this service. Thanks.

@cmooney - Sorry to trouble you, but might you have any update on the feasibility this request for our Ceph/S3 service, please?

Ben, my apologies I'd a half-written reply here I never posted.

@cmooney @ayounsi - Are you happy in principle for us to set up this new anycast service for the Ceph Object Gateway, using the steps outlined here? https://wikitech.wikimedia.org/wiki/Anycast#Deploying_a_new_service

In principal yes that is definitely a possibility. As I see it the main advantage of doing so, rather than going through our existing LVS, would be to not have a single LVS be a bottleneck in the traffic path, given there may be significant upload flows from the dse-k8s-worker nodes to the ceph ones. As discussed there are however some downsides to going that route compared with using the LVS, mainly these elements:

FeatureLVS BehavourDirect BGP Anycast Behaviour
Traffic BalanceBalances flows equally amongst pool, and can be weightedTraffic from clients is load-balanced when the length of BGP paths from client to server is euqal. What this means in practice is that for external (internet) type traffic it usually works well - at our edge all host servers are the same BGP path length away. For internal flows, however, traffic will tend to go to the closest server announcing the IP. So for instance if there is a client and server in the same rack, all the traffic from that client will go to that server, rather than be sent across the DC to another server.
Disruption when pool members changeMinimises the disruption to existing flows as much as possible through use of the maglev scheduler.When deciding which server to send a flow to the routers hash the IP header of the flow and mod the result against the set of servers announcing the route. That means if the set of servers change then almost *all* flows get re-mapped to a different server, even those that were not being sent to the removed node. Addition of nodes to the group also causes similar disruption
Draining a realserverLVS can continue to announce service IP but not send any new flows to a given server to gracefully drainAs long as a host announces a route traffic - for both existing and new flows - will be routed to it by the switches. There is no way to stop new connection attempts going to an IP without stopping existing flows getting there too.

But as long as you are aware of that and happy to proceed it is ok. The instructions on the wikitech page are a little out-of-date, thankfully our automation is a little better these days. I can work with you on it and add the required new templates to our homer automation, as well as the Bird profile configured for the servers. I'll update the docs as we go.

We should probably review the setup when our new L4LB - Liberica - is available. It works active/active between the LBs, so potentially can scale horizontally across multiple nodes to provide the required bandwdith, and bring the advantages described before as well. Anyway we can review at that point.

Are you still planning to use these IPs?

10.3.0.8 (Anycast) - rgw.eqiad.anycast.wmnet
10.3.0.9 (Reserved) - rgw.codfw.anycast.wmnet

That seems ok. As discussed it would be nice to enable this for IPv6 too - so I have allocated a new range in Netbox for that purpose:

https://netbox.wikimedia.org/ipam/prefixes/1074/ip-addresses/

So, for instance the equivalent IPv6 addresses I would suggest are:

2a02:ec80:ff00:101::8 (Anycast) - rgw.eqiad.anycast.wmnet
2a02:ec80:ff00:101::9 (Reserved) - rgw.codfw.anycast.wmnet

I think first step is to add the puppet anycast role to the required servers and we can go through what we need to add. Once it's looking good on the Bird/server side we can set the "bgp" flag for those hosts to 'true' in Netbox, and run Homer against the switches which should cause the routes to be announced.

Change #1070589 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Update prefix-lists for new private, global IPv6 ranges

https://gerrit.wikimedia.org/r/1070589

Change #1070592 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add new global IPv6 private range to base firewall defs

https://gerrit.wikimedia.org/r/1070592

Change #1070589 merged by jenkins-bot:

[operations/homer/public@master] Update prefix-lists for new private, global IPv6 ranges

https://gerrit.wikimedia.org/r/1070589

In principal yes that is definitely a possibility. As I see it the main advantage of doing so, rather than going through our existing LVS, would be to not have a single LVS be a bottleneck in the traffic path, given there may be significant upload flows from the dse-k8s-worker nodes to the ceph ones. As discussed there are however some downsides to going that route compared with using the LVS, mainly these elements:

Thanks for that clear and detailed explanation of the pros and cons of the solution.

But as long as you are aware of that and happy to proceed it is ok.

Yes please. I still expect the potential benefit of not having a single LVS server limiting the upload bandwidth to Ceph/S3 to be greater than the impact of the Anycast behaviour that you have identified.

We should probably review the setup when our new L4LB - Liberica - is available. It works active/active between the LBs, so potentially can scale horizontally across multiple nodes to provide the required bandwdith, and bring the advantages described before as well. Anyway we can review at that point.

Agreed, I would be keen to review whether Liberica could be an even better solution, in the long run.

Are you still planning to use these IPs?

10.3.0.8 (Anycast) - rgw.eqiad.anycast.wmnet
10.3.0.9 (Reserved) - rgw.codfw.anycast.wmnet

Yes, those IP addresses, but the DNS name is slightly different, as I put in a .dpe. element to help identify the service owners and match the ceph configuration.

btullis@cephosd1001:~$ openssl s_client -connect localhost:443 <<<Q 2>&1 |grep subject
subject=CN = rgw.eqiad.dpe.anycast.wmnet

I have configured envoyproxy with this DNS name rgw.eqiad.dpe.anycast.wmnet in the TLS cetificate and this is listening on port 443 on all five realservers now.
The radosgw service is listening on port 80, so I think it's ready to go.

As discussed it would be nice to enable this for IPv6 too - so I have allocated a new range in Netbox for that purpose:
https://netbox.wikimedia.org/ipam/prefixes/1074/ip-addresses/

I'll add the extra DNS element to these suggested addresses as well, so they will become:

2a02:ec80:ff00:101::8 (Anycast) - rgw.eqiad.dpe.anycast.wmnet
2a02:ec80:ff00:101::9 (Reserved) - rgw.codfw.dpe.anycast.wmnet

Many thanks, yes that's perfect.

I think first step is to add the puppet anycast role to the required servers and we can go through what we need to add. Once it's looking good on the Bird/server side we can set the "bgp" flag for those hosts to 'true' in Netbox, and run Homer against the switches which should cause the routes to be announced.

Great! I will work on that puppet patch for the bird configuration now and tag you for a review.

Change #1070916 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64

https://gerrit.wikimedia.org/r/1070916

Change #1070916 merged by Cathal Mooney:

[operations/dns@master] Add include statement for new IPv6 ptr range 2a02:ec80:ff00:101::/64

https://gerrit.wikimedia.org/r/1070916

I have assigned the four IP addresses in netbox.

image.png (226×1 px, 69 KB)

Looks good, thanks!

Change #1070949 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable IPv6 for the envoyproxy on DPE Ceph servers

https://gerrit.wikimedia.org/r/1070949

Change #1070950 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the anycast VIP for radosgw to DPE Ceph servers

https://gerrit.wikimedia.org/r/1070950

Change #1070949 merged by Btullis:

[operations/puppet@production] Enable IPv6 for the envoyproxy on DPE Ceph servers

https://gerrit.wikimedia.org/r/1070949

Change #1070950 merged by Btullis:

[operations/puppet@production] Add the anycast VIP for radosgw to DPE Ceph servers

https://gerrit.wikimedia.org/r/1070950

The anycast VIPs have now been added to the five realservers by bird.

btullis@cumin1002:~$ sudo cumin --no-progress A:cephosd 'ip a sh | egrep "(10.3.0.8|2a02:ec80:ff00:101::8)"'
5 hosts will be targeted:
cephosd[1001-1005].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====
(5) cephosd[1001-1005].eqiad.wmnet
----- OUTPUT of 'ip a sh | egrep ...80:ff00:101::8)"' -----
    inet 10.3.0.8/32 scope global lo:anycast
    inet6 2a02:ec80:ff00:101::8/128 scope global 
================
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'ip a sh | egrep ...80:ff00:101::8)"'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
btullis@cumin1002:~$

Puppet runs cleanly and bird is active, so I think that we are ready to proceed to the next step. Namely.

Once it's looking good on the Bird/server side we can set the "bgp" flag for those hosts to 'true' in Netbox, and run Homer against the switches which should cause the routes to be announced.

I'm happy to leave this until next week @cmooney, if you think that's best.

Change #1071617 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Manually define BGP neighbors for cephosd1*** Anycast BGP

https://gerrit.wikimedia.org/r/1071617

Change #1071617 merged by Cathal Mooney:

[operations/puppet@production] Manually define BGP neighbors for cephosd1*** Anycast BGP

https://gerrit.wikimedia.org/r/1071617

Change #1071633 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] Add definition for cephosd hosts to map to Anycast BGP group

https://gerrit.wikimedia.org/r/1071633

Change #1071633 merged by Cathal Mooney:

[operations/software/homer/deploy@master] Add definition for cephosd hosts to map to Anycast BGP group

https://gerrit.wikimedia.org/r/1071633

Change #1070592 merged by Cathal Mooney:

[operations/puppet@production] Add new global IPv6 private range to base firewall defs

https://gerrit.wikimedia.org/r/1070592

I believe we can say that this is done. There is a small puppet change related to BFD support that @cmooney is still working to fix, but the rados gateways on all five of our cephosd100[1-5] servers are now available for port 443 traffic on 10.3.0.8 and 2a02:ec80:ff00:101::8. The DNS name of the service is rgw.eqiad.dpe.anycast.wmnet.

This is now up and running. Doing a test I can see flows are being sent via different top-of-rack switches (i.e. to different cephosd hosts):

cmooney@stat1008:~$ mtr -w -c40 --tcp --port 443 -b rgw.eqiad.dpe.anycast.wmnet
Start: 2024-09-12T09:59:14+0000
HOST: stat1008                                                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae1-1030.cr1-eqiad.wikimedia.org (2620:0:861:104:fe00::1)       0.0%    40    0.4   1.7   0.3  18.3   3.2
  2.|-- et-0-0-31-100.ssw1-e1-eqiad.wikimedia.org (2620:0:861:fe07::2)  0.0%    40    0.9   0.9   0.6   3.8   0.5
  3.|-- lo0-5000.lsw1-e3-eqiad.eqiad.wmnet (2620:0:861:11b::5)          0.0%    40    0.8   1.4   0.5  12.7   2.4
        lo0-5000.lsw1-e2-eqiad.eqiad.wmnet (2620:0:861:11b::4)        
        lo0-5000.lsw1-f1-eqiad.eqiad.wmnet (2620:0:861:11b::7)        
        lo0-5000.lsw1-f2-eqiad.eqiad.wmnet (2620:0:861:11b::8)        
  4.|-- rgw.eqiad.dpe.anycast.wmnet (2a02:ec80:ff00:101::8)             0.0%    40    0.4   0.4   0.2   0.5   0.1

Change #1080253 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Reduce the sensitivity of the anycast healthcheck for cephosd/radosgw

https://gerrit.wikimedia.org/r/1080253

Change #1080253 merged by Btullis:

[operations/puppet@production] Reduce the sensitivity of the anycast healthcheck for cephosd/radosgw

https://gerrit.wikimedia.org/r/1080253