Page MenuHomePhabricator

ceph: test and decide 1 network interface setup
Closed, ResolvedPublic

Description

In T319184: Move WMCS servers to 1 single NIC we're trying to reduce 10G port footprint. Ceph OSD servers use 2 ports.

Before we move the farm of servers to 1 single port, we need to decide and/or test on the performance.

The puppet code is mostly ready by means of https://gerrit.wikimedia.org/r/c/operations/puppet/+/856675/

Several concerns and ideas have been discussed so far, including:

  • the main problem seems to be ceph OSD self-DDoSing themselves when initial OSD synchronization happens. The control/heartbeat traffic seems to be lost among the huge replication data flows. OSD are therefore marked as down and the ceph cluster gets unreliable (or directly offline).
  • a potential solution is introducing some kind of QoS, to make sure the heartbeats always have room in the wire and are never lost.
  • there seems to be some support on ceph to facilitate this, by using osd_heartbeat_use_min_delay_socket=true
  • we also agreed on potentially consulting with an external expert on this subject for advice.

Event Timeline

aborrero triaged this task as High priority.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I thought that we had decided already to test, and depending on that then decided if go/nogo for the implementation.

I thought that we had decided already to test, and depending on that then decided if go/nogo for the implementation.

Ok. This task is to track that work. Please rename the title as you see fit.

dcaro renamed this task from ceph: decide and/or test 1 network interface setup performance to ceph: test and decide 1 network interface setup.Dec 19 2022, 3:42 PM