Page MenuHomePhabricator

Allow jumbo frames between cloud hosts in production realm
Closed, ResolvedPublic

Description

During the investigation of the ceph issues for cloud hosts in the new WMCS racks E4 and F4 it was determined that they are configured for Jumbo frames (9000 byte MTU) on their production realm (ceph public) as well as cloud realm (ceph cluster) interfaces.

The benefit of this seems marginal, as the ceph clients running on cloudvirt nodes are configured with a 1500 byte MTU, so communication with clients only occurs using regular sized packets. That said it seems likely jumbos are needed for hearbeat messages:

https://tracker.ceph.com/issues/18438

Previously this worked fine, as all cloud hosts were on the same production vlan, and all our L2 Vlans are configured to support junbo frames. The problem is that in the new racks the hosts are connected to separate subnets/vlans, and thus any L3 interfaces configured in the related vlans, and L3 links between devices, also need to support jumbo frames.

Right now that is not the case, for instance irb.1108 on cloudsw1-c8-eqiad is only configured at 1500, whereas the irb.1120 on cloudsw1-e4 is at 9202, thus:

cmooney@cloudcephosd1025:~$ sudo traceroute -I -w 1 -m 3 10.64.20.58
traceroute to 10.64.20.58 (10.64.20.58), 3 hops max, 60 byte packets
 1  irb-1123.cloudsw1-e4-eqiad.eqiad.wmnet (10.64.148.1)  1.009 ms  0.990 ms  0.988 ms
 2  irb-1108.cloudsw1-c8-eqiad.eqiad.wmnet (10.64.147.0)  12.977 ms  12.974 ms  12.972 ms
 3  cloudcephosd1007.eqiad.wmnet (10.64.20.58)  0.116 ms  0.140 ms  0.138 ms
cmooney@cloudcephosd1025:~$ sudo traceroute -I -w 1 --mtu 10.64.20.58
traceroute to 10.64.20.58 (10.64.20.58), 3 hops max, 65000 byte packets
 1  irb-1123.cloudsw1-e4-eqiad.eqiad.wmnet (10.64.148.1)  2.403 ms F=9000  6.652 ms  8.751 ms
 2  * * *
 3  * * *

FWIW no ICMP 'frag-needed' packet is generated/received due to the mismatch either side between cloudsw1-e4 and cloudsw1-c8.

The good thing is that all these Vlans have their GWs on the cloudsw's. So it should be possible to make sure that all of them, plus all internal cloudsw links, are configured to allow 9000 byte IP packets, while keeping the cloudsw -> core router links at 1500. This will result in a similar situation to before, where jumbo frames can pass between cloud hosts, but once they try to reach an "outside" subnet (not configured on a cloud host) they pass through a 1500 MTU link to core routers, and need to be fragmented or otherwise.

I'll do some more checks / add patches to try and get this working and update the task.

Event Timeline

cmooney triaged this task as Medium priority.Aug 17 2022, 2:07 PM
cmooney created this task.

Ok so looking at this a bit closer it seems the ommision was just that the MTU wasn't set high on cloudsw1-c8, on its links to the new switches in racks E4 and F4.

I've adjusted that up now, matching the rest of the network, and I can now ping with 9000 byte IP packet between them:

cmooney@cloudcephosd1025:~$ sudo ping -c 4 -I 10.64.148.2 -M do -s 8972 -4 cloudcephosd1007
PING  (10.64.20.58) from 10.64.148.2 : 8972(9000) bytes of data.
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=1 ttl=62 time=0.197 ms
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=2 ttl=62 time=0.194 ms
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=3 ttl=62 time=0.192 ms
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=4 ttl=62 time=0.106 ms

---  ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 0.106/0.172/0.197/0.038 ms

Same goes for a host in rack F4, for instance cloudcephosd1034:

cmooney@cloudcephosd1034:~$ sudo ip link set dev enp175s0f0np0 mtu 9000
cmooney@cloudcephosd1034:~$ sudo ping -c 4 -I 10.64.149.6 -M do -s 8972 -4 cloudcephosd1007
PING  (10.64.20.58) from 10.64.149.6 : 8972(9000) bytes of data.
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=1 ttl=62 time=0.226 ms
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=2 ttl=62 time=0.267 ms
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=3 ttl=62 time=0.252 ms
8980 bytes from cloudcephosd1007.eqiad.wmnet (10.64.20.58): icmp_seq=4 ttl=62 time=0.230 ms

---  ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3027ms
rtt min/avg/max/mdev = 0.226/0.243/0.267/0.016 ms

@fnegri note I manually increased the local MTU on cloudcephosd1034 as shown above. Similar to what you had done on 1025 earlier.

I believe that should solve the problem. A note for the record the routed links to the CR routers also support jumbo frames, in the cloud vrf and production realm. Which actually matches the previous setup (where gateway for cloud-hosts Vlan on the CRs allowed jumbos). The restriction to 1500 is on our edge interfaces connecting to the public internet, so in terms of PMTUd the overall picture is the same as before the recent L3 redesign.

Ok gonna close this one as the cloud team have confirmed things are now working for them.

Apologies for the oversight!

Change 824494 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Ceph OSD hosts: set mtu on both ifaces

https://gerrit.wikimedia.org/r/824494

Change 824494 merged by David Caro:

[operations/puppet@production] Ceph OSD hosts: set mtu on both ifaces

https://gerrit.wikimedia.org/r/824494

That seemed to do the trick yes!
Thanks!