Page MenuHomePhabricator

[cloudceph] test the new DELL hard drives throughput
Closed, ResolvedPublic

Description

This means:

(copied from comment below) Did some tests, and we are on the clear, the new hard drives are performant enough (at low level) to handle the current load we have in the cluster:

https://docs.google.com/spreadsheets/d/1KJeexHRXOR6W2gkkqgnIOIxBBXnuylz8pNdE_3iyslo/edit?gid=872250942#gid=872250942

image.png (171×1 px, 81 KB)

Summary of improvement Needed improvement Difference (if green we are ok, if red we are not)
read iops 256.73% 183.33% 73.39%
read throughput 203.55% 183.33% 20.22%
write iops 377.22% 212.50% 164.72%
write throughput 277.84% 212.50% 65.34%

  • add the cluster back to the node
  • do some monitoring of the drive compared to the other ones (just in case)

Related Objects

StatusSubtypeAssignedTask
Resolveddcaro
Resolvedtaavi
Resolveddcaro

Event Timeline

dcaro triaged this task as High priority.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-27T08:41:00Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-27T08:41:07Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=99) (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-27T08:41:27Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.depool_and_destroy (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-27T11:59:57Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.depool_and_destroy (exit_code=0) (T390134)

Icinga downtime and Alertmanager silence (ID=2db5921e-9fd3-4768-9222-3e33bdad8325) set by dcaro@cumin1002 for 20 days, 0:00:00 on 1 host(s) and their services with reason: Installing a disk for testing

cloudcephosd1029.eqiad.wmnet

@VRiley-WMF hi! cloudcephosd1029 is ready to get one disk replaced by the dell new one :)

It's turned off and all, so just turn it on whenever you are finished and ping me here in the task.

Thanks!

During dcaro's PTO he wants me to get the host back up and confirm that the drive appears to the OS. He'll do performance testing when he's back.

After a bit of monkeying with the raid settings (to mark the new drive as non-raid) I can now see the drive in lsblk:

sdc              8:32   0     7T  0 disk

The drive presents as 6.98T in the raid controller bios UI, so it seems like that's what we're getting.

Thanks @Jclark-ctr! I'm reassigning this to David for performance testing next week.

Did some tests, and we are on the clear, the new hard drives are performant enough (at low level) to handle the current load we have in the cluster:

https://docs.google.com/spreadsheets/d/1KJeexHRXOR6W2gkkqgnIOIxBBXnuylz8pNdE_3iyslo/edit?gid=872250942#gid=872250942

image.png (171×1 px, 81 KB)

Summary of improvement Needed improvement Difference (if green we are ok, if red we are not)
read iops 256.73% 183.33% 73.39%
read throughput 203.55% 183.33% 20.22%
write iops 377.22% 212.50% 164.72%
write throughput 277.84% 212.50% 65.34%

Mentioned in SAL (#wikimedia-cloud-feed) [2025-04-24T17:25:40Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-04-25T04:29:28Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-04-26T07:47:53Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-04-26T07:48:00Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-04-26T07:48:53Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.unset_cluster_maintenance (T390134)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-04-26T07:48:57Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.unset_cluster_maintenance (exit_code=0) (T390134)

dcaro updated the task description. (Show Details)