Investigate if there's any way to allow ceph to degrade service (not kill other OSDs) if jumbo frames begin to be dropped around the network.
Things to verify:
- Is the don't fragment being set for heartbeat traffic? Can that be configured?
- How to verify: doing a tcpdump of hearbteat traffic and checking the flags on the packets
- What size have the hearbeat packets? Can that be configured? (yes it can)
- How to verify: doing a tcpdump of hearbteat traffic and checking the max size for them
- Is the don't fragment being set for regular traffic? Can that be configured?
- How to verify: doing a tcpdump of osd<->osd traffic and checking the flags on the packets
- Can ceph regular traffic adapt to the discovered MTU of the network? (as opposed to always using the max MTU of the interface) If so, can that be configured?
- How to verify: TBD
Other stuff
Current OSD hearbeat config options:
root@cloudcephosd1001:~# ceph config show-with-defaults osd.33 ... osd_heartbeat_grace 20 default osd_heartbeat_interval 6 default osd_heartbeat_min_healthy_ratio 0.330000 default osd_heartbeat_min_peers 10 default osd_heartbeat_min_size 2000 <- this seems the most interesting default osd_heartbeat_stale 600 default osd_heartbeat_use_min_delay_socket false default ...
Related code/bugs/docs:
- https://tracker.ceph.com/issues/20087 5 years old, where the jumbo frame support was added to heartbeats
- https://github.com/ceph/ceph/blob/main/src/messages/MOSDPing.h#L139 heartbeat messages definition
- https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#osds-report-down-osds On how the osds flag other osds down, the following snippet is interesting:
mon osd reporter subtree level is used to group the peers into the “subcluster” by their common ancestor type in CRUSH map. By default, only two reports from different subtree are required to report another Ceph OSD Daemon down. You can change the number of reporters from unique subtrees and the common ancestor type required to report a Ceph OSD Daemon down to a Ceph Monitor by adding an mon osd min down reporters and mon osd reporter subtree level settings under the [mon] section of your Ceph configuration file, or by setting the value at runtime.