[ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Open, HighPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 15 2023, 5:44 PM

Description

Investigate if there's any way to allow ceph to degrade service (not kill other OSDs) if jumbo frames begin to be dropped around the network.

Things to verify:

Is the don't fragment being set for heartbeat traffic? Can that be configured?
- How to verify: doing a tcpdump of hearbteat traffic and checking the flags on the packets

What size have the hearbeat packets? Can that be configured? (yes it can)
- How to verify: doing a tcpdump of hearbteat traffic and checking the max size for them

Is the don't fragment being set for regular traffic? Can that be configured?
- How to verify: doing a tcpdump of osd<->osd traffic and checking the flags on the packets

Can ceph regular traffic adapt to the discovered MTU of the network? (as opposed to always using the max MTU of the interface) If so, can that be configured?
- How to verify: TBD

Other stuff

Current OSD hearbeat config options:

root@cloudcephosd1001:~# ceph config show-with-defaults osd.33
...
osd_heartbeat_grace                                         20                                                                                                                                                                                                                                         default
osd_heartbeat_interval                                      6                                                                                                                                                                                                                                          default
osd_heartbeat_min_healthy_ratio                             0.330000                                                                                                                                                                                                                                   default
osd_heartbeat_min_peers                                     10                                                                                                                                                                                                                                         default
osd_heartbeat_min_size                                      2000              <- this seems the most interesting                                                                                                                                                                                                                            default
osd_heartbeat_stale                                         600                                                                                                                                                                                                                                        default
osd_heartbeat_use_min_delay_socket                          false                                                                                                                                                                                                                                      default
...

Related code/bugs/docs:

https://tracker.ceph.com/issues/20087 5 years old, where the jumbo frame support was added to heartbeats
https://github.com/ceph/ceph/blob/main/src/messages/MOSDPing.h#L139 heartbeat messages definition
https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#osds-report-down-osds On how the osds flag other osds down, the following snippet is interesting:

mon osd reporter subtree level is used to group the peers into the “subcluster” by their common ancestor type in CRUSH map. By default, only two reports from different subtree are required to report another Ceph OSD Daemon down. You can change the number of reporters from unique subtrees and the common ancestor type required to report a Ceph OSD Daemon down to a Ceph Monitor by adding an mon osd min down reporters and mon osd reporter subtree level settings under the [mon] section of your Ceph configuration file, or by setting the value at runtime.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved		ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved		ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved		ayounsi	T327862 Use mgmt_junos on all network devices
			Restricted Task
Open		None	T316539 Upgrade network devices to Junos 20+
Open		cmooney	T316544 Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Open		dcaro	T297083 [ceph] Getting rack level HA
Resolved		• nskaggs	T329498 [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Resolved		None	T329592 beta cluster down
Resolved		taavi	T329581 PAWS down
Resolved	BUG REPORT	Lydia_Pintscher	T329555 WDQS tutorial on toolforge is down
Resolved	BUG REPORT	Andrew	T329590 grafana.wmcloud.org offline following cloud wide outage
Resolved	BUG REPORT	taavi	T329589 gerrit copy of cloud/instance-puppet stopped replicating
Resolved		aborrero	T329611 Toolforge grid: start webservices after outage
Resolved	BUG REPORT	Ladsgroup	T329934 Wikimedia Chat (Mattermost instance) is down
Resolved	BUG REPORT	dcaro	T329535 Cloud Ceph outage 2023-02-13
Open		dcaro	T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Resolved	Request	Papaul	T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet