During the investigation of the ceph issues for cloud hosts in the new WMCS racks E4 and F4 it was determined that they are configured for Jumbo frames (9000 byte MTU) on their production realm (ceph public) as well as cloud realm (ceph cluster) interfaces.
The benefit of this seems marginal, as the ceph clients running on cloudvirt nodes are configured with a 1500 byte MTU, so communication with clients only occurs using regular sized packets. That said it seems likely jumbos are needed for hearbeat messages:
https://tracker.ceph.com/issues/18438
Previously this worked fine, as all cloud hosts were on the same production vlan, and all our L2 Vlans are configured to support junbo frames. The problem is that in the new racks the hosts are connected to separate subnets/vlans, and thus any L3 interfaces configured in the related vlans, and L3 links between devices, also need to support jumbo frames.
Right now that is not the case, for instance irb.1108 on cloudsw1-c8-eqiad is only configured at 1500, whereas the irb.1120 on cloudsw1-e4 is at 9202, thus:
cmooney@cloudcephosd1025:~$ sudo traceroute -I -w 1 -m 3 10.64.20.58 traceroute to 10.64.20.58 (10.64.20.58), 3 hops max, 60 byte packets 1 irb-1123.cloudsw1-e4-eqiad.eqiad.wmnet (10.64.148.1) 1.009 ms 0.990 ms 0.988 ms 2 irb-1108.cloudsw1-c8-eqiad.eqiad.wmnet (10.64.147.0) 12.977 ms 12.974 ms 12.972 ms 3 cloudcephosd1007.eqiad.wmnet (10.64.20.58) 0.116 ms 0.140 ms 0.138 ms
cmooney@cloudcephosd1025:~$ sudo traceroute -I -w 1 --mtu 10.64.20.58 traceroute to 10.64.20.58 (10.64.20.58), 3 hops max, 65000 byte packets 1 irb-1123.cloudsw1-e4-eqiad.eqiad.wmnet (10.64.148.1) 2.403 ms F=9000 6.652 ms 8.751 ms 2 * * * 3 * * *
FWIW no ICMP 'frag-needed' packet is generated/received due to the mismatch either side between cloudsw1-e4 and cloudsw1-c8.
The good thing is that all these Vlans have their GWs on the cloudsw's. So it should be possible to make sure that all of them, plus all internal cloudsw links, are configured to allow 9000 byte IP packets, while keeping the cloudsw -> core router links at 1500. This will result in a similar situation to before, where jumbo frames can pass between cloud hosts, but once they try to reach an "outside" subnet (not configured on a cloud host) they pass through a 1500 MTU link to core routers, and need to be fragmented or otherwise.
I'll do some more checks / add patches to try and get this working and update the task.