Page MenuHomePhabricator

[ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Open, HighPublic

Description

Investigate if there's any way to allow ceph to degrade service (not kill other OSDs) if jumbo frames begin to be dropped around the network.

Things to verify:

  • Is the don't fragment being set for heartbeat traffic? Can that be configured?
    • How to verify: doing a tcpdump of hearbteat traffic and checking the flags on the packets
  • What size have the hearbeat packets? Can that be configured? (yes it can)
    • How to verify: doing a tcpdump of hearbteat traffic and checking the max size for them
  • Is the don't fragment being set for regular traffic? Can that be configured?
    • How to verify: doing a tcpdump of osd<->osd traffic and checking the flags on the packets
  • Can ceph regular traffic adapt to the discovered MTU of the network? (as opposed to always using the max MTU of the interface) If so, can that be configured?
    • How to verify: TBD

Other stuff

Current OSD hearbeat config options:

root@cloudcephosd1001:~# ceph config show-with-defaults osd.33
...
osd_heartbeat_grace                                         20                                                                                                                                                                                                                                         default
osd_heartbeat_interval                                      6                                                                                                                                                                                                                                          default
osd_heartbeat_min_healthy_ratio                             0.330000                                                                                                                                                                                                                                   default
osd_heartbeat_min_peers                                     10                                                                                                                                                                                                                                         default
osd_heartbeat_min_size                                      2000              <- this seems the most interesting                                                                                                                                                                                                                            default
osd_heartbeat_stale                                         600                                                                                                                                                                                                                                        default
osd_heartbeat_use_min_delay_socket                          false                                                                                                                                                                                                                                      default
...

Related code/bugs/docs:

mon osd reporter subtree level is used to group the peers into the “subcluster” by their common ancestor type in CRUSH map. By default, only two reports from different subtree are required to report another Ceph OSD Daemon down. You can change the number of reporters from unique subtrees and the common ancestor type required to report a Ceph OSD Daemon down to a Ceph Monitor by adding an mon osd min down reporters and mon osd reporter subtree level settings under the [mon] section of your Ceph configuration file, or by setting the value at runtime.

Related Objects

Event Timeline

dcaro triaged this task as High priority.
dcaro updated the task description. (Show Details)
dcaro removed dcaro as the assignee of this task.May 22 2025, 7:40 AM

I think there are essentially two potential situations here:

  1. A large packet tries to be sent over a link with a small MTU, properly configured both sides
    • In this case the router/switch/host trying to send the big packet knows it cannot be sent out the destination interface where the MTU is small
    • It will drop the big packet and send a "packet too big" ICMP back to the source
    • This is the Path MTU discovery mechanism and the source should then retry with a lower size
  2. A large packet is sent over a link with misconfigured MTU on each side
    • Side A is configured for jumbo frames, so it transmits the large packet, it doesn't know there will be a problem the other side
    • Side B gets the big frame, but it is not accepted as it exceeds MTU
    • No ICMP is generated back to source as the OS on Side B does not know about the packet

In our infrastructure, unfortunately, scenario 2 exists. We have *all* our network interfaces configured for jumbo frames across the network. Therefore a jumbo frame sent to a host with MTU=1500 will be transmitted to that host by our switch, and we get scenario 2. This causes no issues if everything is using jumbo frames, or if everything is using normal size. It's also generally fine with TCP flows as the hosts set an appropriate MSS and a sender won't exceed what the other side can receive.

The way to properly ensure it could never happen is to set switch-side MTU to match the host in all cases, but that adds an additional layer of complexity and co-ordination we currently do not have.