Page MenuHomePhabricator

Cloudcephosd: migrate to single network uplink
Open, MediumPublic

Description

I am opening this task to specifically look at migrating the cloudcephosd* hosts to a single network uplink, from their current dual-link setup. The other cloud hosts have already been moved to a single uplink under T319184.

Current Setup

Cloudcephosd hosts connect to two separate networks. Following the previous best practice from the ceph project they are configured to use a public and cluster network. The cloudcephosd hosts use their WMF production realm / primary network uplink for the 'public' ceph connectivity, and have a second port connected to a cloud-storage1 vlan which they use for the 'cluster' connectivity.

Issue

The issue with the current setup is that the use of two ports increases the cost and complexity of the network infrastructure, as it means a single top-of-rack switch has insufficient ports to serve a full rack of cloudcephosd hosts. They are the only hosts in our DCs with dual connections, which complicates our provisioning/automation, and as things stand means we need to do manual edits to get things working.

If we examine our ceph nodes we see the peak combined usage across each host's 2x10G links does not exceed 10G/sec. And the peak rates are only occasionally observed, the typical combined usage across both ports is in the region of 1-2Gb/sec. So it seems in terms of bandwidth a single connection will suffice (and further there is some scope to use 25G links where available).

Setup/migration

To move from dual network links to a single one we will add a new vlan sub-interface to each host's primary network uplink. This separate logical interface will connect the cluster/storage subnet on each host, instead of the second physical port.

The vlans used for storage are as follows (racks C8 & D5 share the same subnet/vlan due to the way things evolved historically):

SiteRackVlan IDSubnetVlan Name
eqiadC81106192.168.4.0/24cloud-storage1-eqiad
eqiadD51106192.168.4.0/24cloud-storage1-eqiad
eqiadE41121192.168.5.0/24cloud-storage1-e4-eqiad
eqiadF41122192.168.6.0/24cloud-storage1-f4-eqiad
codfwB12106192.168.4.0/24cloud-storage1-b-codfw

The IPs configured for the storage network are (afaik) set up in puppet here, which also references the interface the storage IP goes on (currently the physical second link).

We need to discuss and work out the exact way to configure the new interface in puppet, and also how to introduce it gracefully and migrate from the current setup. At a very high level an approach like this might work:

  • Ensure the storage vlan is trunked to the primary interface of all the cloudcpehosd's on the switch side (non-disruptive, netops can do it)
  • Create puppet patches to add the new vlan-subinterface for the appropriate vlan id as a child of the main physical
    • Similar to how the cloud-private is added on other hosts
    • Merely creating the interface - with no IPs on it - won't cause any existing traffic paths to change
  • Starting with the new hosts we can then change cluster network 'iface' in hiera from the physical second port to the new vlan interface
    • We also need to make sure the aggregate 192.168.0.0/16 route is present (it should be)
  • Once all cloudcephosd hosts have the 'iface' for the cluster network as the sub-int we can remove the second physical links

Happy to discuss further. We can probably trial the setup/automation on the new high-density ceph hosts (T394333).

Procedures

Logical side

  • For each double-nic host:
    • Send a puppet change flipping single_iface: true for its configuration
    • Merge and apply the puppet change on the affected host
    • Verify ceph is happy and stays happy, e.g. ceph health on cloudcontrol1006
  • If the host fails to come back on the network after the puppet run:
    • Revert the single_iface: true puppet patch and merge
    • Use the host console to log in and restore connectivity by configuring the second interface
    • Run puppet on the host for clean up actions

Physical side

Assuming the above is complete for all hosts (eqiad and codfw) we can proceed with:

  • Disable extra ports in netbox, and deploy changes
  • Unplug extra network cables

Status

Logical side done

  • cloudcephosd1035
  • cloudcephosd1036
  • cloudcephosd1037
  • cloudcephosd1038
  • cloudcephosd1039
  • cloudcephosd1040
  • cloudcephosd1041
  • cloudcephosd1042
  • cloudcephosd1043
  • cloudcephosd1044
  • cloudcephosd1045
  • cloudcephosd1046
  • cloudcephosd1047
  • cloudcephosd1048
  • cloudcephosd1049
  • cloudcephosd1050
  • cloudcephosd1051
  • cloudcephosd1052
  • cloudcephosd2004-dev.codfw.wmnet
  • cloudcephosd2005-dev.codfw.wmnet
  • cloudcephosd2006-dev.codfw.wmnet
  • cloudcephosd2007-dev.codfw.wmnet

Event Timeline

cmooney triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
cmooney updated the task description. (Show Details)
Andrew changed the task status from Open to Stalled.Aug 4 2025, 3:05 PM

As discussed in today's meeting I believe all the cloudcephosd hosts have jumbo frames enabled on all their physical interfaces.

So there should be no problem creating a vlan sub-interface from the host's primary uplink with MTU=9000. I suspect I tried to test this on a host previously that was reimaged under 'insetup' role and thus the primary link only had regular MTU.

In terms of how to proceed we can probably pick one of the new hosts and see if we can get it working on that alone? @taavi if you can change the switch port to 'trunk' mode and add the tagged vlan for that one host that would be great. Provided that works and we have a way forward to configure in puppet we can script up adding the tagged vlan to the ports connecting all the other cloudcephosd primary links.

We have successfully put in service cloudcephosd1050 and cloudcephosd1051 in T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration with single-nic, I haven't seen any problem whatsoever with those hosts so I think we're ready to roll out this change to the rest of the fleet.

@taavi @Andrew @cmooney what do you think of the above? if everything works as expected then I think we can also explore retrofitting 10g hosts with the extra 25g nics if that's something doable

@taavi @Andrew @cmooney what do you think of the above?

The plan sounds good. We need to audit and make sure all the primary links to the cloudcephosd* hosts are set to trunk mode, with the storage vlan as one of the tagged interfaces (in Netbox, and then homer run if we change anything). This change can be made non-disruptively in advance however, so shouldn't be a blocker.

if everything works as expected then I think we can also explore retrofitting 10g hosts with the extra 25g nics if that's something doable

We need to bear in mind that only the switches in racks E4 and F4 support 25G. But in those racks this should be possible, there is a small complication due to port speeds having to be set in blocks of four, so we will need to juggle them to new ports as we move, but shouldn't be an issue.

fgiunchedi changed the task status from Stalled to Open.Oct 29 2025, 7:49 AM

@taavi @Andrew @cmooney what do you think of the above?

The plan sounds good. We need to audit and make sure all the primary links to the cloudcephosd* hosts are set to trunk mode, with the storage vlan as one of the tagged interfaces (in Netbox, and then homer run if we change anything). This change can be made non-disruptively in advance however, so shouldn't be a blocker.

SGTM, I have opened T409690 to track this work. In the meantime 1048 and 1049 do seem configured correctly to me, so I'll use those to test procedures (https://netbox.wikimedia.org/dcim/interfaces/42510/ and https://netbox.wikimedia.org/dcim/interfaces/42507/)

if everything works as expected then I think we can also explore retrofitting 10g hosts with the extra 25g nics if that's something doable

We need to bear in mind that only the switches in racks E4 and F4 support 25G. But in those racks this should be possible, there is a small complication due to port speeds having to be set in blocks of four, so we will need to juggle them to new ports as we move, but shouldn't be an issue.

Good to know! thank you

Change #1203383 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: switch 1048 to single interface

https://gerrit.wikimedia.org/r/1203383

Change #1203384 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: switch 1049 to single interface

https://gerrit.wikimedia.org/r/1203384

OSD nodes up through 1034 are scheduled for decom in 2026. Unless there's an urgent port shortage, we should only retcon 1035 and above to avoid sending DC ops on multiple visits to the older hosts.

Mentioned in SAL (#wikimedia-cloud) [2025-11-17T15:49:13Z] <godog> set ceph cluster noout/norebalance and move cloudcephosd1048 to single nic - T399180

Change #1203383 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: switch 1048 to single interface

https://gerrit.wikimedia.org/r/1203383

Mentioned in SAL (#wikimedia-cloud) [2025-11-17T15:58:11Z] <godog> set ceph cluster back to out and rebalance - T399180

Change #1203384 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: switch 1049 to single interface

https://gerrit.wikimedia.org/r/1203384

Mentioned in SAL (#wikimedia-cloud) [2025-11-18T10:00:18Z] <godog> switch cloudcephosd1049 to single nic - T399180

Change #1207739 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: move row C hosts to single NIC

https://gerrit.wikimedia.org/r/1207739

Change #1207740 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: move row D hosts to single NIC

https://gerrit.wikimedia.org/r/1207740

Change #1207741 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: move rack E4 hosts to single NIC

https://gerrit.wikimedia.org/r/1207741

Change #1207742 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: move rack F4 hosts to single NIC

https://gerrit.wikimedia.org/r/1207742

Change #1207743 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudcephosd: move codfw hosts to single NIC

https://gerrit.wikimedia.org/r/1207743

Change #1207739 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: move row C hosts to single NIC

https://gerrit.wikimedia.org/r/1207739

Change #1207740 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: move row D hosts to single NIC

https://gerrit.wikimedia.org/r/1207740

Change #1207741 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: move rack E4 hosts to single NIC

https://gerrit.wikimedia.org/r/1207741

Change #1207742 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: move rack F4 hosts to single NIC

https://gerrit.wikimedia.org/r/1207742

Change #1207743 merged by Filippo Giunchedi:

[operations/puppet@production] cloudcephosd: move codfw hosts to single NIC

https://gerrit.wikimedia.org/r/1207743

The logical side on the host side is done. Next up is deleting the interfaces from netbox for the hosts and unplug network cables. I'll file subtasks

I took a look at why cloudcephosd1052 still has second nic up, currently:

4: ens1f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 04:32:01:57:55:61 brd ff:ff:ff:ff:ff:ff
    altname enp13s0f1np1
    inet 192.168.5.14/24 scope global ens1f1np1
       valid_lft forever preferred_lft forever
    inet6 fe80::632:1ff:fe57:5561/64 scope link
       valid_lft forever preferred_lft forever
7: vlan1121@ens1f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 04:32:01:57:55:60 brd ff:ff:ff:ff:ff:ff
    inet 192.168.5.14/24 scope global vlan1121
       valid_lft forever preferred_lft forever
    inet6 fe80::632:1ff:fe57:5560/64 scope link
       valid_lft forever preferred_lft forever
root@cloudcephosd1052:~# ip r
default via 10.64.148.1 dev ens1f0np0 onlink 
10.64.148.0/24 dev ens1f0np0 proto kernel scope link src 10.64.148.31 
192.168.4.0/24 via 192.168.5.254 dev ens1f1np1 
192.168.5.0/24 dev ens1f1np1 proto kernel scope link src 192.168.5.14 
192.168.5.0/24 dev vlan1121 proto kernel scope link src 192.168.5.14 
192.168.6.0/24 via 192.168.5.254 dev ens1f1np1

Per puppet:

"cloudcephosd1052.eqiad.wmnet":
  public:
    addr: "10.64.148.31"
    iface: "ens1f0np0"
  cluster:
    addr: "192.168.5.14"
    prefix: "24"
    iface: "ens1f1np1"
    single_iface: true

And /etc/network/interfaces:

# The primary network interface
allow-hotplug ens1f0np0
iface ens1f0np0 inet static
        address 10.64.148.31/24
        gateway 10.64.148.1
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 10.3.0.1
        dns-search eqiad.wmnet
        pre-up /sbin/ip token set ::10:64:148:31 dev ens1f0np0
        up ip addr add 2620:0:861:11c:10:64:148:31/64 dev ens1f0np0
   mtu 9000
allow-hotplug enp13s0f1np1
iface enp13s0f1np1 inet manual
   up ip addr add 192.168.5.14/24 dev enp13s0f1np1
   mtu 9000
   post-up ip route add 192.168.4.0/24 via 192.168.5.254 dev enp13s0f1np1
   post-up ip route add 192.168.6.0/24 via 192.168.5.254 dev enp13s0f1np1
allow-hotplug ens1f1np1
auto vlan1121
iface vlan1121 inet manual
   vlan-raw-device ens1f0np0
   up ip addr add 192.168.5.14/24 dev vlan1121
   mtu 9000
   post-up ip route add 192.168.4.0/24 via 192.168.5.254 dev vlan1121
   pre-down ip route del 192.168.4.0/24 via 192.168.5.254 dev vlan1121
   post-up ip route add 192.168.6.0/24 via 192.168.5.254 dev vlan1121
   pre-down ip route del 192.168.6.0/24 via 192.168.5.254 dev vlan1121
allow-hotplug vlan1121

It seems enp13s0f1np1 has stayed configured in ifupdown, though not on the system. I think the easiest would be to:

  • Remove the spurious enp13s0f1np1 config, run puppet to verify no other changes will be applied
  • Make sure the ifupdown config matches e.g. cloudcephosd1050, modulo addresses
  • Reboot the host and verify addresses/interfaces come up as expected

I think the easiest would be to:

  • Remove the spurious enp13s0f1np1 config, run puppet to verify no other changes will be applied
  • Make sure the ifupdown config matches e.g. cloudcephosd1050, modulo addresses
  • Reboot the host and verify addresses/interfaces come up as expected

Yeah I think this should be ok.

Mentioned in SAL (#wikimedia-operations) [2025-12-16T13:01:39Z] <godog> fix network configuration and reboot cloudcephosd1052 - T399180

I think the easiest would be to:

  • Remove the spurious enp13s0f1np1 config, run puppet to verify no other changes will be applied
  • Make sure the ifupdown config matches e.g. cloudcephosd1050, modulo addresses
  • Reboot the host and verify addresses/interfaces come up as expected

Yeah I think this should be ok.

This is done, cloudcephosd1052 now runs on a single interface as well