In T319184: Move WMCS servers to 1 single NIC we're trying to reduce 10G port footprint. Ceph OSD servers use 2 ports.
Before we move the farm of servers to 1 single port, we need to decide and/or test on the performance.
The puppet code is mostly ready by means of https://gerrit.wikimedia.org/r/c/operations/puppet/+/856675/
Several concerns and ideas have been discussed so far, including:
- the main problem seems to be ceph OSD self-DDoSing themselves when initial OSD synchronization happens. The control/heartbeat traffic seems to be lost among the huge replication data flows. OSD are therefore marked as down and the ceph cluster gets unreliable (or directly offline).
- a potential solution is introducing some kind of QoS, to make sure the heartbeats always have room in the wire and are never lost.
- there seems to be some support on ceph to facilitate this, by using osd_heartbeat_use_min_delay_socket=true
- we also agreed on potentially consulting with an external expert on this subject for advice.