Page MenuHomePhabricator

Provide a TCP MSS clamping mechanism for real servers
Closed, ResolvedPublic

Description

real servers behind a load balancer that forwards traffic their way using some kind of encapsulation as IPIP need to perform TCP MSS clamping to avoid fragmentation issues.

This mechanism should be generic enough to run on the plethora of real servers that WMF currently runs. For example we have real servers (the CDN cluster) where netfilter can't be used.

discarded alternatives:

viable alternatives:

remaining alternatives to explore:

  • ip rule + ip route
  • nftables

Details

TitleReferenceAuthorSource BranchDest Branch
Perform clamping only on the specified source portrepos/sre/tcp-mss-clamper!2vgutierrezfilter-by-portmain
Provide basic functionalityrepos/sre/tcp-mss-clamper!1vgutierrezbasic-functionalitymain
Customize query in GitLab

Event Timeline

Vgutierrez triaged this task as Medium priority.Nov 3 2023, 9:51 AM
Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.

I've explored the BPF_PROG_TYPE_SOCK_OPS alternative and I've detected the following caveats:

  1. the eBPF program needs to be loaded before the daemon performs a listen(2) otherwise it won't be able to set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG early enough to capture the three-way handshake on an already listening socket
  2. bpf_store_hdr_opt() refuses to write an MSS option because it's already there (returns -EEXIST)

This is the code of the eBPF program used during the experimentation

/*
 * SPDX-License-Identifier: GPL-2.0
 * Copyright 2023-present Valentin Gutierrez
 * Copyright 2023-present Wikimedia Foundation, Inc.
*/
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

struct mss_option {
    __u8 kind;
    __u8 length;
    __be16 mss;
} __attribute__((__packed__));

SEC("sockops")
int bpf_sockops_cb(struct bpf_sock_ops *skops) {
    bpf_printk("I'm handling %lu at the moment\n", skops->op);
    switch (skops->op) {
        case BPF_SOCK_OPS_TCP_LISTEN_CB:
            bpf_printk("I'm on BPF_SOCK_OPS_TCP_LISTEN_CB\n");
            bpf_sock_ops_cb_flags_set(skops,
                    skops->bpf_sock_ops_cb_flags | BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG);
            break;
        case BPF_SOCK_OPS_HDR_OPT_LEN_CB:
            bpf_reserve_hdr_opt(skops, 4, 0); 
            break;
        case BPF_SOCK_OPS_WRITE_HDR_OPT_CB:
            bpf_printk("I'm on BPF_SOCK_OPS_WRITE_HDR_OPT_CB with flags = 0x%03x\n", skops->skb_tcp_flags);
            // we only act on TCP SYN packets
            if (!(skops->skb_tcp_flags & 0x2)) {
                return 0;
            }   
            bpf_printk("And it's a SYN packet\n");

            struct mss_option mss = { 
                .kind = 2,
                .length = 4,
                .mss = bpf_htons(1442),
            };  
            long ret = bpf_store_hdr_opt(skops, &mss, 4, 0); 
            if(ret) {
                bpf_printk("this didn't work as expected: %d\n", ret);
            }   
            bpf_sock_ops_cb_flags_set(skops,
                    skops->bpf_sock_ops_cb_flags | ~(BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG));
            break;
    }   
    return 1;
}
char _license[] SEC("license") = "GPL";
This comment was removed by Vgutierrez.

Providing an alternative routing table with advmss set to the required MSS value can be hard on some C:lvs::realserver instances, especially those running k8s and/or bird where routes are managed dynamically:

sudo -i cumin 'C:lvs::realserver' 'ip route |grep bird |wc -l || echo 0'
[...]
(16) kubernetes[1025-1026,1047-1056].eqiad.wmnet,ml-serve[1005-1008].eqiad.wmnet                                                                                                  
----- OUTPUT of 'ip route |grep b...|wc -l || echo 0' -----                                                                                                                       
20                                                                                                                                                                                
===== NODE GROUP =====                                                                                                                                                            
(60) aux-k8s-ctrl[1001-1002].eqiad.wmnet,aux-k8s-worker[1001-1002].eqiad.wmnet,dse-k8s-ctrl[1001-1002].eqiad.wmnet,kubemaster[1001-1002].eqiad.wmnet,kubernetes[1005-1024,1027-1046,1057-1058].eqiad.wmnet,kubestage[1003-1004].eqiad.wmnet,kubestagemaster[1001-1002].eqiad.wmnet,ml-serve[1001-1004].eqiad.wmnet,ml-serve-ctrl[1001-1002].eqiad.wmnet             
----- OUTPUT of 'ip route |grep b...|wc -l || echo 0' -----                                                                                                                       
58                                                                                                                                                                                
===== NODE GROUP =====                                                                                                                                                            
(71) kubemaster[2001-2002].codfw.wmnet,kubernetes[2005-2053,2055-2056].codfw.wmnet,kubestage[2001-2002].codfw.wmnet,kubestagemaster[2001-2002].codfw.wmnet,ml-serve[2001-2008].codfw.wmnet,ml-serve-ctrl[2001-2002].codfw.wmnet,ml-staging[2001-2002].codfw.wmnet,ml-staging-ctrl[2001-2002].codfw.wmnet                                                            
----- OUTPUT of 'ip route |grep b...|wc -l || echo 0' -----                                                                                                                       
67                                                                                                                                                                                
===== NODE GROUP =====                                                                                                                                                            
(794) aqs[2001-2012].codfw.wmnet,aqs[1010-1021].eqiad.wmnet,cloudelastic[1001-1006].wikimedia.org,cloudweb[1003-1004].wikimedia.org,cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090,1100-1114].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[4037-4051].ulsfo.wmnet,datahubsearch[1001-1003].eqiad.wmnet,dbproxy[1018-1019].eqiad.wmnet,druid[1004-1011].eqiad.wmnet,elastic[2037-2048,2050-2086].codfw.wmnet,elastic[1053-1102].eqiad.wmnet,ldap-replica[1003-1004,2005-2006].wikimedia.org,logstash[2023-2025,2030-2032].codfw.wmnet,logstash[1023-1025,1030-1032].eqiad.wmnet,lvs[2011-2014].codfw.wmnet,lvs[6001-6003].drmrs.wmnet,lvs[1017-1020].eqiad.wmnet,lvs[5004-5006].eqsin.wmnet,lvs[3008-3010].esams.wmnet,lvs[4008-4010].ulsfo.wmnet,maps[2005-2010].codfw.wmnet,maps[1005-1010].eqiad.wmnet,moss-fe2001.codfw.wmnet,moss-fe1001.eqiad.wmnet,ms-fe[2009-2014].codfw.wmnet,ms-fe[1009-1014].eqiad.wmnet,mw[2259-2279,2281-2339,2350-2451].codfw.wmnet,mw[1349-1496].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet,ncredir[2001-2002].codfw.wmnet,ncredir[6001-6002].drmrs.wmnet,ncredir[1001-1002].eqiad.wmnet,ncredir[5001-5002].eqsin.wmnet,ncredir[3003-3004].esams.wmnet,ncredir[4001-4002].ulsfo.wmnet,parse[2001-2020].codfw.wmnet,parse[1001-1024].eqiad.wmnet,prometheus[2005-2006].codfw.wmnet,prometheus[1005-1006].eqiad.wmnet,registry[2003-2004].codfw.wmnet,registry[1003-1004].eqiad.wmnet,restbase[2013-2027].codfw.wmnet,restbase[1019-1033].eqiad.wmnet,schema[2003-2004].codfw.wmnet,schema[1003-1004].eqiad.wmnet,thanos-fe[2001-2004].codfw.wmnet,thanos-fe[1001-1004].eqiad.wmnet,titan[2001-2002].codfw.wmnet,titan[1001-1002].eqiad.wmnet,wcqs[2001-2003].codfw.wmnet,wcqs[1001-1003].eqiad.wmnet,wdqs[2007-2025].codfw.wmnet,wdqs[1006-1008,1011-1016].eqiad.wmnet
----- OUTPUT of 'ip route |grep b...|wc -l || echo 0' -----                                                                                                                       
0

from those 794 realservers that don't have dynamic routes, 772 only have 2 routes on their main routing table (default gateway + the subnet where the NIC is connected to) and those would probably benefit from a ip rule + ip route approach to get their MSS clamped.

nftables or other specific measures should be considered for the other cases but it looks like we don't have a generic solution that could be safely deployed globally for all the real servers that we currently run.

@cmooney @ayounsi I'm aware that perfoming TCP MSS clamping at the network layer isn't a great solution from your point of view but could you elaborate on the downsides of doing it at that layer?

Thanks for looking into that.
I left some comments on T348837#9253591 but the ip route option seems the cleanest to me.
As a general rule of thumbs having a middle-box transparently intercepting and modifying packets is a bad idea. It goes against the end-to-end principle, it makes noticing and troubleshooting issues more complex (if not impossible), It's also a risk of vendor lock-in and doesn't have the needed granularity (applied per router interface only).

tcp-mss-clamper is being already used to perform MSS clamping on ncredir and CDN upload clusters