Page MenuHomePhabricator

Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration
Closed, ResolvedPublic

Description

Following up from today's wmcs/network sync; it was decided to test-run a single NIC configuration for cloudcephosd. This is also a followup on T399180: Cloudcephosd: migrate to single network uplink, to be noted that both NICs on cloudcephosd hosts have MTU 9000 since T315446: Allow jumbo frames between cloud hosts in production realm.

The end state as far as this task is concerned is as such:

  • One 25G NIC connected on cloudcephosd1050 and cloudcephosd1051
  • On Linux the NIC will have untagged traffic on the main network, as everything else
  • There will be a sub-interface on VLAN id XXX for the storage network
  • IP addressing remains unchanged

Event Timeline

Change #1191086 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloudcephosd1050: Configure ceph with a single nic

https://gerrit.wikimedia.org/r/1191086

I can confirm the switch is already set to accept tagged traffic for the storage vlan on the ports connecting to both these two hosts (think it was set that way from before).

@fgiunchedi, 1050 and 1051 should already be fully puppetized with Ceph 18.x packages and mostly ready to go. The next step to get them in service is the 'wmcs.ceph.osd.bootstrap_and_add' cookbook.

It runs some tests ahead of time to make sure that versions + networking are correct; I don't know offhand if the network tests will pass or fail with a single-nic config; you might wind up needing to patch the cookbook.

It will also take a day or two for the whole server to pool, so I would probably just ctrl-c out of it while it's waiting for the first group to rebalance; that way we can give the single-nic setup a good long trial with one drive before committing too much. You'll also need some extra flags since 1050 is different from the default OSD node... your full command will look something like this:

$ sudo cookbook wmcs.ceph.osd.bootstrap_and_add --osd-hostname cloudcephosd1050 --cluster-name eqiad1 --skip-reboot --expected-ceph-version 18 --os-hw-raid --expected-osd-drives 10 --batch-size 1

That will format and prepare all 10 drives, then depool all 10, and then pool 1 and wait for rebalancing. The near-eternal rebalancing phase will look like this:

Cluster still has (580031) misplaced objects, at the current 144 obj/s should take 1:06:47.708183 to finish, waiting 0:00:10 (timeout=8:00:00, elapsed=5:03:19.073537)...

but the time estimates will be extremely inaccurate. It's fine to ctrl-c out of that phase; we have another cookbook (wmcs.ceph.osd.undrain_node) which we can run at a later date to resume adding the drives one-by-one.

Oh, also: if you want to reimage either of them: everything currently needs to be Bookworm, and the debian installer sometimes gets caught up during the partman phase. I've had good luck connecting to the console, escaping out of the partition phase in the installer and trying again, it seems to always take the second time.

taavi triaged this task as Medium priority.Oct 1 2025, 2:15 PM

Change #1191086 merged by Filippo Giunchedi:

[operations/puppet@production] Cloudcephosd1050: Configure ceph with a single nic

https://gerrit.wikimedia.org/r/1191086

Change #1194933 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudceph: fix single-nic vlan interface specification

https://gerrit.wikimedia.org/r/1194933

Change #1194933 merged by Filippo Giunchedi:

[operations/puppet@production] cloudceph: fix single-nic vlan interface specification

https://gerrit.wikimedia.org/r/1194933

Change #1194967 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudceph: handle double -> single NIC transition

https://gerrit.wikimedia.org/r/1194967

cloudcephosd1050 is now running with a single nic, and at least I can ping it fine on its 192.168 address. I'll start putting the host in service next week

Change #1195192 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] interface: only bring down existing tagged interfaces

https://gerrit.wikimedia.org/r/1195192

Change #1195193 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] interface: add pre_down_command define

https://gerrit.wikimedia.org/r/1195193

Change #1195194 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] interface: del route on interface down

https://gerrit.wikimedia.org/r/1195194

Change #1195192 merged by Filippo Giunchedi:

[operations/puppet@production] interface: only bring down existing tagged interfaces

https://gerrit.wikimedia.org/r/1195192

Change #1195193 merged by Filippo Giunchedi:

[operations/puppet@production] interface: add pre_down_command define

https://gerrit.wikimedia.org/r/1195193

Change #1195194 merged by Filippo Giunchedi:

[operations/puppet@production] interface: del route on interface down

https://gerrit.wikimedia.org/r/1195194

Change #1196372 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] wmcs: introduce cloud_storage_subnet variables

https://gerrit.wikimedia.org/r/1196372

Change #1196372 merged by Filippo Giunchedi:

[operations/puppet@production] wmcs: introduce cloud_storage_subnet variables

https://gerrit.wikimedia.org/r/1196372

Change #1194967 merged by Filippo Giunchedi:

[operations/puppet@production] cloudceph: handle double / single NIC transition

https://gerrit.wikimedia.org/r/1194967

Change #1196798 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: move cloudcephosd1051 to single NIC

https://gerrit.wikimedia.org/r/1196798

Change #1196798 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: move cloudcephosd1051 to single NIC

https://gerrit.wikimedia.org/r/1196798

For the record, these are the effects of moving an host from double to single nic:

root@cloudcephosd1051:~# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: eno8303: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c4:cb:e1:f7:3c:7a brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
3: ens1f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 6c:92:cf:a2:66:70 brd ff:ff:ff:ff:ff:ff
    altname enp10s0f0np0
4: eno8403: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c4:cb:e1:f7:3c:7b brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
5: ens1f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 6c:92:cf:a2:66:71 brd ff:ff:ff:ff:ff:ff
    altname enp10s0f1np1
6: eno12399np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6c:92:cf:a5:80:d0 brd ff:ff:ff:ff:ff:ff
    altname enp31s0f0np0
    inet 10.64.149.32/24 brd 10.64.149.255 scope global eno12399np0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:11d:10:64:149:32/64 scope global 
       valid_lft 2591988sec preferred_lft 604788sec
    inet6 fe80::6e92:cfff:fea5:80d0/64 scope link 
       valid_lft forever preferred_lft forever
7: eno12409np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6c:92:cf:a5:80:d1 brd ff:ff:ff:ff:ff:ff
    altname enp31s0f1np1
    inet 192.168.6.13/24 scope global eno12409np1
       valid_lft forever preferred_lft forever
    inet6 fe80::6e92:cfff:fea5:80d1/64 scope link 
       valid_lft forever preferred_lft forever
8: idrac: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether d0:c1:b5:14:34:b0 brd ff:ff:ff:ff:ff:ff
root@cloudcephosd1051:~# run-puppet-agent 
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for cloudcephosd1051.eqiad.wmnet
Info: Applying configuration version '(34ea9ae410) Filippo Giunchedi - hieradata: move cloudcephosd1051 to single NIC'
Notice: /Stage[main]/Profile::Cloudceph::Osd/Exec[bring-down-extra-iface]/returns: executed successfully
Notice: Augeas[unconfigure-extra-nic](provider=augeas): 
--- /etc/network/interfaces	2025-10-15 07:45:24.196920067 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:49.288771742 +0000
@@ -19,10 +19,3 @@
 	up ip addr add 2620:0:861:11d:10:64:149:32/64 dev eno12399np0
    mtu 9000
 allow-hotplug eno12409np1
-iface eno12409np1 inet manual
-   up ip addr add 192.168.6.13/24 dev eno12409np1
-   mtu 9000
-   post-up ip route add 192.168.4.0/24 via 192.168.6.254 dev eno12409np1
-   post-up ip route add 192.168.5.0/24 via 192.168.6.254 dev eno12409np1
-   pre-down ip route del 192.168.4.0/24 via 192.168.6.254 dev eno12409np1
-   pre-down ip route del 192.168.5.0/24 via 192.168.6.254 dev eno12409np1

Notice: /Stage[main]/Profile::Cloudceph::Osd/Augeas[unconfigure-extra-nic]/returns: executed successfully
Notice: Augeas[vlan1122](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:49.316771569 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:49.404771025 +0000
@@ -19,3 +19,6 @@
 	up ip addr add 2620:0:861:11d:10:64:149:32/64 dev eno12399np0
    mtu 9000
 allow-hotplug eno12409np1
+auto vlan1122
+iface vlan1122 inet manual
+   vlan-raw-device eno12399np0

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Tagged[vlan1122]/Augeas[vlan1122]/returns: executed successfully (corrective)
Info: /Stage[main]/Profile::Cloudceph::Osd/Interface::Tagged[vlan1122]/Augeas[vlan1122]: Scheduling refresh of Exec[/sbin/ifup vlan1122]
Notice: Augeas[vlan1122_manual](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:49.432770852 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:49.456770704 +0000
@@ -22,3 +22,4 @@
 auto vlan1122
 iface vlan1122 inet manual
    vlan-raw-device eno12399np0
+allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Manual[osd-cluster]/Augeas[vlan1122_manual]/returns: executed successfully
Notice: Augeas[vlan1122_192.168.6.13/24](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:49.484770531 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:49.508770383 +0000
@@ -22,4 +22,5 @@
 auto vlan1122
 iface vlan1122 inet manual
    vlan-raw-device eno12399np0
+   up ip addr add 192.168.6.13/24 dev vlan1122
 allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Ip[osd-cluster-ip]/Augeas[vlan1122_192.168.6.13/24]/returns: executed successfully
Notice: Augeas[vlan1122_osd-cluster-mtu](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:49.536770210 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:49.556770087 +0000
@@ -23,4 +23,5 @@
 iface vlan1122 inet manual
    vlan-raw-device eno12399np0
    up ip addr add 192.168.6.13/24 dev vlan1122
+   mtu 9000
 allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Setting[osd-cluster-mtu]/Augeas[vlan1122_osd-cluster-mtu]/returns: executed successfully
Info: Interface::Setting[osd-cluster-mtu]: Scheduling refresh of Exec[set-osd-cluster-mtu]
Notice: /Stage[main]/Profile::Cloudceph::Osd/Exec[set-osd-cluster-mtu]/returns: Cannot find device "vlan1122"
Error: /Stage[main]/Profile::Cloudceph::Osd/Exec[set-osd-cluster-mtu]: Failed to call refresh: '/usr/sbin/ip link set mtu 9000 vlan1122' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Cloudceph::Osd/Exec[set-osd-cluster-mtu]: '/usr/sbin/ip link set mtu 9000 vlan1122' returned 1 instead of one of [0]
Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Tagged[vlan1122]/Exec[/sbin/ifup vlan1122]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Route[route_to_192_168_4_0]/Exec[ip route add 192.168.4.0/24 via 192.168.6.254 dev vlan1122]/returns: executed successfully
Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Route[route_to_192_168_5_0]/Exec[ip route add 192.168.5.0/24 via 192.168.6.254 dev vlan1122]/returns: executed successfully
Notice: Augeas[post-up_vlan1122_route_to_192_168_4_0_persist](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:49.584769914 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:55.256734869 +0000
@@ -24,4 +24,5 @@
    vlan-raw-device eno12399np0
    up ip addr add 192.168.6.13/24 dev vlan1122
    mtu 9000
+   post-up ip route add 192.168.4.0/24 via 192.168.6.254 dev vlan1122
 allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Route[route_to_192_168_4_0]/Interface::Post_up_command[route_to_192_168_4_0_persist]/Augeas[post-up_vlan1122_route_to_192_168_4_0_persist]/returns: executed successfully
Notice: Augeas[pre-down_vlan1122_route_to_192_168_4_0_persist](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:55.284734696 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:55.308734548 +0000
@@ -25,4 +25,5 @@
    up ip addr add 192.168.6.13/24 dev vlan1122
    mtu 9000
    post-up ip route add 192.168.4.0/24 via 192.168.6.254 dev vlan1122
+   pre-down ip route del 192.168.4.0/24 via 192.168.6.254 dev vlan1122
 allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Route[route_to_192_168_4_0]/Interface::Pre_down_command[route_to_192_168_4_0_persist]/Augeas[pre-down_vlan1122_route_to_192_168_4_0_persist]/returns: executed successfully
Notice: Augeas[post-up_vlan1122_route_to_192_168_5_0_persist](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:55.336734375 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:55.360734227 +0000
@@ -26,4 +26,5 @@
    mtu 9000
    post-up ip route add 192.168.4.0/24 via 192.168.6.254 dev vlan1122
    pre-down ip route del 192.168.4.0/24 via 192.168.6.254 dev vlan1122
+   post-up ip route add 192.168.5.0/24 via 192.168.6.254 dev vlan1122
 allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Route[route_to_192_168_5_0]/Interface::Post_up_command[route_to_192_168_5_0_persist]/Augeas[post-up_vlan1122_route_to_192_168_5_0_persist]/returns: executed successfully
Notice: Augeas[pre-down_vlan1122_route_to_192_168_5_0_persist](provider=augeas): 
--- /etc/network/interfaces	2025-10-17 07:25:55.388734054 +0000
+++ /etc/network/interfaces.augnew	2025-10-17 07:25:55.412733906 +0000
@@ -27,4 +27,5 @@
    post-up ip route add 192.168.4.0/24 via 192.168.6.254 dev vlan1122
    pre-down ip route del 192.168.4.0/24 via 192.168.6.254 dev vlan1122
    post-up ip route add 192.168.5.0/24 via 192.168.6.254 dev vlan1122
+   pre-down ip route del 192.168.5.0/24 via 192.168.6.254 dev vlan1122
 allow-hotplug vlan1122

Notice: /Stage[main]/Profile::Cloudceph::Osd/Interface::Route[route_to_192_168_5_0]/Interface::Pre_down_command[route_to_192_168_5_0_persist]/Augeas[pre-down_vlan1122_route_to_192_168_5_0_persist]/returns: executed successfully
Info: Class[Profile::Cloudceph::Osd]: Unscheduling all events on Class[Profile::Cloudceph::Osd]
Notice: Applied catalog in 10.50 seconds
root@cloudcephosd1051:~# run-puppet-agent 
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for cloudcephosd1051.eqiad.wmnet
Info: Applying configuration version '(34ea9ae410) Filippo Giunchedi - hieradata: move cloudcephosd1051 to single NIC'
Notice: Applied catalog in 10.62 seconds
root@cloudcephosd1051:~# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: eno8303: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c4:cb:e1:f7:3c:7a brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
3: ens1f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 6c:92:cf:a2:66:70 brd ff:ff:ff:ff:ff:ff
    altname enp10s0f0np0
4: eno8403: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c4:cb:e1:f7:3c:7b brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
5: ens1f1np1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 6c:92:cf:a2:66:71 brd ff:ff:ff:ff:ff:ff
    altname enp10s0f1np1
6: eno12399np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 6c:92:cf:a5:80:d0 brd ff:ff:ff:ff:ff:ff
    altname enp31s0f0np0
    inet 10.64.149.32/24 brd 10.64.149.255 scope global eno12399np0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:11d:10:64:149:32/64 scope global 
       valid_lft 2591981sec preferred_lft 604781sec
    inet6 fe80::6e92:cfff:fea5:80d0/64 scope link 
       valid_lft forever preferred_lft forever
7: eno12409np1: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:92:cf:a5:80:d1 brd ff:ff:ff:ff:ff:ff
    altname enp31s0f1np1
8: idrac: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether d0:c1:b5:14:34:b0 brd ff:ff:ff:ff:ff:ff
9: vlan1122@eno12399np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 6c:92:cf:a5:80:d0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.6.13/24 scope global vlan1122
       valid_lft forever preferred_lft forever
    inet6 fe80::6e92:cfff:fea5:80d0/64 scope link 
       valid_lft forever preferred_lft forever

There's some harmless error/race when setting mtu, though eventually things converge when the interface is brought up

Notice: /Stage[main]/Profile::Cloudceph::Osd/Exec[set-osd-cluster-mtu]/returns: Cannot find device "vlan1122"
Error: /Stage[main]/Profile::Cloudceph::Osd/Exec[set-osd-cluster-mtu]: Failed to call refresh: '/usr/sbin/ip link set mtu 9000 vlan1122' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Cloudceph::Osd/Exec[set-osd-cluster-mtu]: '/usr/sbin/ip link set mtu 9000 vlan1122' returned 1 instead of one of [0]

Nice! I'm eager to see the results of adding it to the cluster, as now a single NIC might be able to be faster than before (according to my calculations xd), but might also overload a switch faster, and/or now mix both kinds of traffic so other newer behaviors might appear.

Mentioned in SAL (#wikimedia-operations) [2025-10-20T11:52:43Z] <godog> add cloudcephosd1051 to the cluster via wmcs.ceph.osd.bootstrap_and_add - T405478

Nice! I'm eager to see the results of adding it to the cluster, as now a single NIC might be able to be faster than before (according to my calculations xd), but might also overload a switch faster, and/or now mix both kinds of traffic so other newer behaviors might appear.

Indeed, cloudcephosd1051 is now also single nic and currently being added to the cluster with wmcs.ceph.osd.bootstrap_and_add (only one osd). We can put progressively more OSDs in service over the next few days and observe

Change #1197245 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudceph: set mtu only when interfaces exist

https://gerrit.wikimedia.org/r/1197245

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-21T15:47:44Z] <filippo@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-21T23:48:09Z] <filippo@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T405478)

Change #1197245 merged by Filippo Giunchedi:

[operations/puppet@production] cloudceph: set mtu only when interfaces exist

https://gerrit.wikimedia.org/r/1197245

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-22T07:51:16Z] <filippo@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-22T15:51:36Z] <filippo@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-23T07:08:05Z] <filippo@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T405478)

Change #1198209 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] ceph: bump timeout for drain/undrain

https://gerrit.wikimedia.org/r/1198209

Change #1198209 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] ceph: bump timeout for drain/undrain

https://gerrit.wikimedia.org/r/1198209

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-23T15:08:25Z] <filippo@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-23T16:02:53Z] <filippo@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-24T08:21:05Z] <filippo@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-24T11:14:19Z] <filippo@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node (T405478)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-25T18:54:33Z] <filippo@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) (T405478)

This is complete, both hosts are in service with full weight and single NIC. I'll follow up with the rollout for the rest of the fleet

-19          69.86284         -   70 TiB    20 TiB    20 TiB  532 MiB   54 GiB   50 TiB  28.29  0.99    -                  host cloudcephosd1050
 29    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  4.2 GiB  5.0 TiB  28.14  0.98  194      up              osd.29
 30    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB  360 MiB  6.7 GiB  5.0 TiB  28.36  0.99  197      up              osd.30
 31    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   16 MiB  5.6 GiB  5.0 TiB  28.44  0.99  198      up              osd.31
 32    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   32 MiB  5.7 GiB  5.0 TiB  28.47  0.99  202      up              osd.32
 33    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   16 MiB  5.5 GiB  5.0 TiB  28.33  0.99  200      up              osd.33
 34    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  5.4 GiB  5.0 TiB  28.07  0.98  196      up              osd.34
 35    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  5.2 GiB  5.0 TiB  28.20  0.98  198      up              osd.35
 36    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   33 MiB  5.3 GiB  5.0 TiB  28.10  0.98  197      up              osd.36
 37    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   16 MiB  5.5 GiB  5.0 TiB  28.68  1.00  199      up              osd.37
 38    ssd    6.98630   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  5.1 GiB  5.0 TiB  28.13  0.98  197      up              osd.38
-22          69.86284         -   70 TiB    20 TiB    20 TiB  159 MiB   52 GiB   50 TiB  28.55  1.00    -                  host cloudcephosd1051
 39    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   16 MiB  4.2 GiB  5.0 TiB  28.90  1.01  200      up              osd.39
 40    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  5.0 GiB  5.0 TiB  28.33  0.99  196      up              osd.40
 41    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  4.4 GiB  5.0 TiB  28.54  1.00  203      up              osd.41
 42    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   16 MiB  5.8 GiB  5.0 TiB  29.10  1.01  199      up              osd.42
 43    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  4.0 GiB  5.0 TiB  28.10  0.98  194      up              osd.43
 44    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   20 MiB  5.2 GiB  5.0 TiB  29.02  1.01  199      up              osd.44
 45    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  6.3 GiB  5.0 TiB  28.38  0.99  198      up              osd.45
 46    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  5.9 GiB  5.0 TiB  28.65  1.00  202      up              osd.46
 47    ssd    6.98628   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  6.0 GiB  5.0 TiB  28.13  0.98  195      up              osd.47
 48    ssd    6.98630   1.00000  7.0 TiB   2.0 TiB   2.0 TiB   15 MiB  5.4 GiB  5.0 TiB  28.31  0.99  196      up              osd.48