Page MenuHomePhabricator

Perform fake disk swap on ms-be2088 as test
Closed, ResolvedPublic

Description

Hi folks!

In the parent task we are looking for a way to test the state of a new disk added via hot-spare replacement to a Supermicro Config J host. Since ms-be2088 is not in production yet (for Swift traffic), would it be possible to set up something like:

  • Remove a disk (no shutdown)
  • Re-insert the disk to check its state

If the above makes sense feel free to do it anytime, or ping me beforehand so if I am around I'll be able to quickly check.

Thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey triaged this task as Medium priority.Jan 20 2025, 3:25 PM

@Jhancock.wm @Papaul do you think that this test would be enough to "simulate" a usual hot swap for a broken disk? Namely, if the controller would set the disk into "foreign state" or similar after re-inserting it. If so we can proceed anytime, so I'll be able to test a couple of options on the OS side. Lemme know!

DCops removed a disk and re-inserted it as-is in its bay. I am going to list my attempts to clear its state (that is indeed showing an unhealthy status):

>>> pprint(r.request("get", "/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7").json())
{'@odata.etag': '"6df54d3c36bc8bdf7a19584b9accc700"',
 '@odata.id': '/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7',
 '@odata.type': '#Drive.v1_11_1.Drive',
 'Actions': {'#Drive.SecureErase': {'target': '/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7/Actions/Drive.SecureErase'},
             'Oem': {'#SmcDrive.Indicate': {'@Redfish.ActionInfo': '/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7/Oem/Supermicro/IndicateActionInfo',
                                            'target': '/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7/Actions/Oem/SmcDrive.Indicate'}}},
 'BlockSizeBytes': 512,
 'CapableSpeedGbs': 12,
 'CapacityBytes': 8001020755968,
 'FailurePredicted': False,
 'Id': 'Disk.Bay.7',
 'IndicatorLED': 'Off',
 'Links': {'Volumes': []},
 'LocationIndicatorActive': False,
 'Manufacturer': 'ATA',
 'MediaType': 'HDD',
 'Model': 'HGST HUS728T8TAL',
 'Name': 'Disk.Bay.7',
 'Oem': {'Supermicro': {'@odata.type': '#SmcDriveExtensions.v1_0_0.Drive',
                        'MediaErrCount': 0,
                        'OtherErrCount': 0,
                        'SmartEventReceived': 0,
                        'Temperature': 37}},
 'ReadyToRemove': None,
 'Revision': 'W9U0W9U0',
 'SerialNumber': 'VY1U56RM',
 'Status': {'Health': 'Critical', 'State': 'Disabled'},
 'StatusIndicator': 'Fail'}

I tried with the following (as suggested by Supermicro's support) but the state didn't clear:

pprint(r.request("post", "/redfish/v1/Chassis/HA-RAID.0.StorageEnclosure.1/Drives/Disk.Bay.7/Actions/Drive.SecureErase", json={"EncryptionStatus": "Foreign"}).json())

The call returns a Task id that can be inspected via Redfish, but the task's state is always in state "Exception" without many other details. I also tried other "EncryptionStatus` values but same issue.

Moreover I tried storecli:

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 show
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive Information :
=================

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
----------------------------------------------------------------------------
250:7    14 UBad  -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TAL U  -    
----------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=PI Eligible
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 set jbod
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Failure
Description = Un-supported command

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 set good
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Failure
Description = Un-supported command

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 set good force
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Failure
Description = Un-supported command

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 set online
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Failure
Description = Un-supported command

This is a preliminary test but it doesn't look good :(

From the megacli's perspective, the drive was Unconfigured (bad) and I was able to make it Good but then, as expected, no real support for JBOD:

elukey@ms-be2088:~$ sudo megacli -pdmakejbod -PhysDrv[250:7] -a0
                                     
Adapter: 0: Failed to change PD state at EnclId-250 SlotId-7.

Even after the Good state I cannot find a way on Redfish to make it JBOD again :(

I also tried via the BMC's webui, that interestingly shows the disk in Unconfigured (bad) state (and I cannot neither set it to Good nor doing anything else):

Screenshot From 2025-01-24 14-35-47.png (822×1 px, 123 KB)

Followed up with Supermicro to show our results, let's see what they say.

Ciao, try -
{

"Target": "/redfish/v1/Systems/{SystemId}/Storage/{StorageId}/Actions/StorageController.ClearForeignConfiguration"

}

@Neobeta61 Ciao! Grazie :)

I tested it but the Action is not available afaics:

'Actions': {'Oem': {'#SmcHARAIDController.Save': {'@Redfish.ActionInfo': '/redfish/v1/Systems/1/Storage/HA-RAID/Oem/Supermicro/SaveActionInfo',
                                                  'target': '/redfish/v1/Systems/1/Storage/HA-RAID/Actions/Oem/SmcHARAIDController.Save'},
                    '#SmcStorage.ClearVolumes': {'@Redfish.ActionInfo': '/redfish/v1/Systems/1/Storage/HA-RAID/Oem/Supermicro/ClearVolumesActionInfo',
                                                 'target': '/redfish/v1/Systems/1/Storage/HA-RAID/Actions/Oem/SmcStorage.ClearVolumes'},
                    '#SmcStorage.CreateVolume': {'@Redfish.ActionInfo': '/redfish/v1/Systems/1/Storage/HA-RAID/Oem/Supermicro/CreateVolumeActionInfo',
                                                 'target': '/redfish/v1/Systems/1/Storage/HA-RAID/Actions/Oem/SmcStorage.CreateVolume'}}},

Thanks for checking, it seems like we are not there yet on the API commands from what i gather, I am checking internally if there is a alternative method.

Do you mind filing a support ticket on this issue for tracking? It will make the PM discussion easier.

@elukey seems like we got to the response you needed on the other tickets.

@Neobeta61 Hi! I just followed up on the email threads, I didn't get any response so far, I tried to summarize my understanding since I am a bit lost to be honest :) My impression is that hot swap cannot be performed with S3908 without a reboot/BIOS-change, no matter what we try. Is it correct? Did I miss something?

So i think I am understanding 2 issues here -
The card does not move into JBOD mode easily, and are looking for guidance?
And the drives are importing with foreign configs that need to be cleared before being added.

For the first option, if you could give me the current firmware level, i can determine if there are updates that need to be changed. The card option between Raid and JBOD firmware mode, is identical, and it's only a firmware change that would push it to JBOD only mode.

For the second option, since these would be a failure on the drive for replacement, could you do a clear/wipe of the config before being manually pulled?

@Neobeta61 Hi! I just followed up on the email threads, I didn't get any response so far, I tried to summarize my understanding since I am a bit lost to be honest :) My impression is that hot swap cannot be performed with S3908 without a reboot/BIOS-change, no matter what we try. Is it correct? Did I miss something?

That is the process we have outlined, i am not saying it cannot be managed in other ways, we have a separate user with ZFS that handles it from the OS stack, but that is the process we have defined.

I've updated the Broadcom 3908's firmware on ms-be2088 as indicated by Supermicro, since the changelog shows some JBOD-related issues that were fixed.

elukey@ms-be2088:~$ sudo ./storcli64 /c0 download file=STG_AOC-S3908L-H8IR-3908-BRCM-UNUSED_20241217_52.31.0-5830_STDsp.rom
Download Completed.     
Flashing image to adapter...
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Success
Description = F/W Flash Completed. Please reboot the system for the changes to take effect

Current package version = 52.24.0-4766
New package version = 52.31.0-5830

And after a reboot:

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 show
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive Information :
=================

--------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                Sp Type 
--------------------------------------------------------------------------------
250:7    14 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
--------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=PI Eligible
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

So I think that we should try to repeat the swap out/in to make sure that we solved the issue, but it looks good for the moment!

@elukey I pulled and then reinserted a disk. all yours.

Thanks a lot @Jhancock.wm!

This is what I see:

[Mon Feb 24 15:23:26 2025] sd 0:2:7:0: SCSI device is removed
[Mon Feb 24 15:23:26 2025] sd 0:2:7:0: [sdh] Synchronizing SCSI cache
[Mon Feb 24 15:23:26 2025] sd 0:2:7:0: [sdh] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Mon Feb 24 15:23:26 2025] XFS (sdh1): Unmounting Filesystem
[Mon Feb 24 15:23:26 2025] XFS (sdh1): log I/O error -5
[Mon Feb 24 15:23:26 2025] XFS (sdh1): xfs_do_force_shutdown(0x2) called from line 1211 of file fs/xfs/xfs_log.c. Return address = 000000006949579e
[Mon Feb 24 15:23:26 2025] XFS (sdh1): Log I/O Error Detected. Shutting down filesystem
[Mon Feb 24 15:23:26 2025] XFS (sdh1): Unable to update superblock counters. Freespace may not be correct on next mount.
[Mon Feb 24 15:23:26 2025] XFS (sdh1): Please unmount the filesystem and rectify the problem(s)

[Mon Feb 24 15:23:26 2025] megaraid_sas 0000:98:00.0: 18993 (793725806s/0x0001/FATAL) - JBOD 07 for PD 0e(e0xfa/s7) is now OFFLINE <======

And again:

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s7 show
CLI Version = 007.3103.0000.0000 Aug 22, 2024
Operating system = Linux 5.10.0-33-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive Information :
=================

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
----------------------------------------------------------------------------
250:7    14 UBad  -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TAL U  -    
----------------------------------------------------------------------------

A reboot fixes the wrong state, but it is not ideal for us :(

According to the MR Functional Spec import foreign drive happens 'at boot'. But I used the restart command 'storcli64 /c0 restart' which does an online controller reset, and in my test the foreign VD was imported.
So I would say a controller reset (either online or offline, such as when rebooting) is required to import a foreign config automatically.

When we do JBOD disk-swaps on our Dell systems, we typically just need to do sudo megacli -pdmakejbod -physdrv [32:7] -a0 or equivalent; sometimes if the drive got flagged as foreign for some reason we had to clear that first, but again that could be done with just sudo megacli -CfgForeign -Clear -a0.

understood, please keep in mind that the Perc controllers (as coming from dell before this) are custom firmware and operations on top of the Broadcom controller operations.
We do not do modifications to the customer firmware, so for our instances, the broadcom CLI subset is the same subset for us.

@Neobeta61 so correct me if i am you are saying that the reboot of the controller and not the reboot of the server did import the foreign VD?

as tested in our lab, yes.

@Neobeta61 thank you. @elukey is it possible for us to pull another disk so we can follow @Neobeta61 testing process to import the foreign VD?

Thanks.

@Papaul I am currently waiting since the host is being used by Jesse for another test. I tried the solution outlined by @Neobeta61 quickly but the version of storecli that I found on the Broadcom's website doesn't support restart, only reset but it doesn't seem to work.

The version that I used is 007.3205.0000.0000 Oct 09, 2024. @Neobeta61, I found the version only under another controller's page (https://www.broadcom.com/products/storage/raid-controllers/megaraid-9560-16i), didn't find anything for 39XX yet. @Neobeta61 if you have a download link to use, could you share it? I am probably missing some website to check :)

Thanks!

@elukey when you get to doing that test, can you make sure the /srv/swift-storage partitions are all mounted before you do so, please? [mount -a should do it if not] I'm interested to see what (if any) impact doing a restart on the controller has to mounted filesystems :)

Do a quick test. OS drive did not connect to AOC-S39xx. Able to run command "storcli /c0 restart" in Administrator (root) in Redhat 9.5.

TS and I believe that at this point its a driver command subset issue -

image.png (997×1 px, 607 KB)

I was able to run restart (the command is not visible in the help, but available) and the output was:

elukey@ms-be2088:~$ sudo ./storcli64 /c0 restart
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Controller Restart in progress, wait for few moments.

Detailed Status :
===============

---------------------------
Ctrl Property        Value 
---------------------------
   0 Adapter Restart     0 
---------------------------


elukey@ms-be2088:~$ sudo ./storcli64 /c0 show | grep UBad
250:5    32 UBad  -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TAL     U  -    
250:7    14 UBad  -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TAL     U  -    
250:11    3 UBad  -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TAL     U  -    
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
UBUnsp=UBad Unsupported|Rbld=Rebuild

So it didn't restore the JBOD state to the disks that were previously pulled off and re-inserted (to simulate the hot swap).

@Neobeta61 what do you mean with its a driver command subset issue ? Something on our side, related to the OS?

i would recommend updating to the specs on drivers in my screenshot. I do not see the same issue on the same kernel with the same hardware.

It's not the same kernel, though - you've got 5.14.0-503.11.1.el9_5 from RHEL, and we have 5.10.234-1 from Debian...
@elukey I dunno if it's worth installing linux-image-6.1-amd64 on this system as a test? That would give us a newer kernel...

@Neobeta61 could you be clearer as to which drivers you think should be updated to which version(s), please? It looks like we have the same storcli version as you, and surely the question of what state the controller thinks the drives are in has very little to do with the overlying OS?

@MatthewVernon I cannot provide guidance on your kernel stack, I am not aware enough of your environment. But will tell you that the environment i leveraged does work as expected.
Sorry i cannot provide more info. If you can send details, maybe over email, i can see about adjusting things when i have a moment.

It's not the same kernel, though - you've got 5.14.0-503.11.1.el9_5 from RHEL, and we have 5.10.234-1 from Debian...
@elukey I dunno if it's worth installing linux-image-6.1-amd64 on this system as a test? That would give us a newer kernel...

At this point I think that it is worth a test. I just installed linux-image-6.1-amd64 on ms-be2088 and rebooted. I'll ask to Jenn another round of pull/push of a disk to see if the storecli restart works.

All right something different happened, but I am not sure if it was the kernel or not.

I rebooted the host with the new 6.1 kernel, and I noticed that storecli showed the same three disks/slots as UBad (Unconfigured, Bad). I tried to restart the controller via storecli, but nothing changed.

Then I ran megacli like:

megacli -pdmakegood -PhysDrv[250:5] -a0
megacli -pdmakegood -PhysDrv[250:7] -a0
megacli -pdmakegood -PhysDrv[250:11] -a0

And checked that storecli showed UGood as new state. Then I restarted the controller, and rechecked storecli's show - all disks showed up as JBOD.

I am very confident that the new kernel didn't play a role, but the only relevant thing that changed was me setting the drives from UBad to UGood via megacli. So now I want to reboot the node to the previous kernel, and then re-do the experiment to confirm that the restart works.

@Jhancock.wm Hi! I apologize in advance for keep requesting the same thing, but could you do another pull/push of a random disk of ms-be2088 when you are in the DC?

@elukey another disk has been pulled! (all i good. i have the easy part)

I think that we have something working!

Starting point:

PD LIST :
=======

--------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                Sp Type 
--------------------------------------------------------------------------------
250:0    29 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:1    35 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:2    33 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:3    30 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:4    34 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:5    32 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:6    16 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:7    14 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:8    12 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:9     8 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:10    5 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 

250:11    3 UBad  -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TAL     U  -                 <==========================
         
251:0    27 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:1    28 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:2    25 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:3    31 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:4    24 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:5    26 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:6     0 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:7     4 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:8    23 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:9     2 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:10   18 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:11    1 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
--------------------------------------------------------------------------------

Set good via storcli:

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s11 set good
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Set Drive Good Succeeded.

Status afterwards:

PD LIST :
=======

--------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                Sp Type 
--------------------------------------------------------------------------------
250:0    29 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:1    35 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:2    33 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:3    30 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:4    34 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:5    32 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:6    16 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:7    14 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:8    12 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:9     8 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:10    5 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 

250:11    3 UGood -  7.276 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  -             <==========================

251:0    27 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:1    28 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:2    25 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:3    31 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:4    24 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:5    26 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:6     0 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:7     4 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:8    23 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:9     2 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:10   18 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:11    1 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD

Controller restart:

elukey@ms-be2088:~$ sudo ./storcli64 /c0 restart
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Controller Restart in progress, wait for few moments.

Detailed Status :
===============

---------------------------
Ctrl Property        Value 
---------------------------
   0 Adapter Restart     0 
---------------------------

Final status:

PD LIST :
=======

--------------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                Sp Type 
--------------------------------------------------------------------------------
250:0    29 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:1    35 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:2    33 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:3    30 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:4    34 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:5    32 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:6    16 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:7    14 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:8    12 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:9     8 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
250:10    5 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD

250:11    3 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD     <==========================

251:0    27 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:1    28 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:2    25 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:3    31 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:4    24 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:5    26 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:6     0 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:7     4 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:8    23 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:9     2 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:10   18 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD 
251:11    1 Onln  -  7.277 TB SATA HDD N   N  512B HGST HUS728T8TALE6L4 U  JBOD

@Neobeta61 I think that we now have something working, thanks a lot!

@MatthewVernon I tried to mount -a all partitions before running the experiment as requested, and even before starting (as expected) an error was raised:

mount: /srv/swift-storage/objects1: mount(2) system call failed: Structure needs cleaning

From dmesg it seems that /dev/sda1 is not good, and that xfs_repair needs to be re-run. I tried but it doesn't find neither the primary nor the secondary XFS superblock, not sure if the partition has been overwritten or similar while we were doing these tests. The rest of the partitions look good though.

Please check ms-be2088 and let me know, we now have to decide if we want to keep this controller or not :)

OK, so the disk pulled was sdl:

Mar 10 16:45:30 ms-be2088 kernel: [267287.723999] megaraid_sas 0000:98:00.0: scanning for scsi0...
Mar 10 16:45:30 ms-be2088 kernel: [267287.724308] sd 0:2:11:0: SCSI device is removed
Mar 10 16:45:30 ms-be2088 kernel: [267287.728697] sd 0:2:11:0: [sdl] Synchronizing SCSI cache
Mar 10 16:45:30 ms-be2088 kernel: [267287.729386] sd 0:2:11:0: [sdl] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 10 16:45:30 ms-be2088 kernel: [267287.738951] XFS (sdl1): Unmounting Filesystem
Mar 10 16:45:30 ms-be2088 kernel: [267287.739160] XFS (sdl1): log I/O error -5
Mar 10 16:45:30 ms-be2088 kernel: [267287.743691] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1211 of file fs/xfs/xfs_log.c. Return address = 000000009fa2df08
Mar 10 16:45:30 ms-be2088 kernel: [267287.743692] XFS (sdl1): Log I/O Error Detected. Shutting down filesystem
Mar 10 16:45:30 ms-be2088 kernel: [267287.743993] XFS (sdl1): Unable to update superblock counters. Freespace may not be correct on next mount.
Mar 10 16:45:30 ms-be2088 kernel: [267287.751334] XFS (sdl1): Please unmount the filesystem and rectify the problem(s)
Mar 10 16:45:30 ms-be2088 kernel: [267287.769355] megaraid_sas 0000:98:00.0: 32371 (794940327s/0x0001/FATAL) - JBOD 0b for PD 03(e0xfa/s11) is now OFFLINE

So the bust filesystem (sda1) is unrelated (presumably damaged some some previous operation). You can see the reset operation in kern.log after which the kernel re-finds sdl. Which looks good.

Mar 11 09:38:34 ms-be2088 kernel: [328070.232093] megaraid_sas 0000:98:00.0: FW provided supportMaxExtLDs: 1    max_lds: 240
Mar 11 09:38:34 ms-be2088 kernel: [328070.232099] megaraid_sas 0000:98:00.0: controller type    : MR(8192MB)
Mar 11 09:38:34 ms-be2088 kernel: [328070.232100] megaraid_sas 0000:98:00.0: Online Controller Reset(OCR)       : Enabled
Mar 11 09:38:34 ms-be2088 kernel: [328070.232101] megaraid_sas 0000:98:00.0: Secure JBOD support        : Yes
Mar 11 09:38:34 ms-be2088 kernel: [328070.232102] megaraid_sas 0000:98:00.0: NVMe passthru support      : Yes
Mar 11 09:38:34 ms-be2088 kernel: [328070.232104] megaraid_sas 0000:98:00.0: FW provided TM TaskAbort/Reset timeout     : 6 secs/60 secs
Mar 11 09:38:34 ms-be2088 kernel: [328070.232105] megaraid_sas 0000:98:00.0: JBOD sequence map support  : Yes
Mar 11 09:38:34 ms-be2088 kernel: [328070.232106] megaraid_sas 0000:98:00.0: PCI Lane Margining support : Yes
Mar 11 09:38:34 ms-be2088 kernel: [328070.841424] megaraid_sas 0000:98:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar 11 09:38:34 ms-be2088 kernel: [328070.841491] megaraid_sas 0000:98:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion
Mar 11 09:38:34 ms-be2088 kernel: [328070.854614] megaraid_sas 0000:98:00.0: resetting fusion adapter scsi0.
Mar 11 09:38:34 ms-be2088 kernel: [328070.855709] megaraid_sas 0000:98:00.0: Outstanding fastpath IOs: 0
Mar 11 09:38:39 ms-be2088 kernel: [328075.239710] megaraid_sas 0000:98:00.0: waiting for controller reset to finish
Mar 11 09:38:39 ms-be2088 kernel: [328075.921334] megaraid_sas 0000:98:00.0: Waiting for FW to come to ready state
Mar 11 09:38:44 ms-be2088 kernel: [328080.345267] megaraid_sas 0000:98:00.0: waiting for controller reset to finish
Mar 11 09:38:49 ms-be2088 kernel: [328085.465176] megaraid_sas 0000:98:00.0: waiting for controller reset to finish
Mar 11 09:38:50 ms-be2088 kernel: [328086.729154] megaraid_sas 0000:98:00.0: FW now in Ready state
Mar 11 09:38:50 ms-be2088 kernel: [328086.729161] megaraid_sas 0000:98:00.0: FW now in Ready state
Mar 11 09:38:50 ms-be2088 kernel: [328086.730302] megaraid_sas 0000:98:00.0: Current firmware supports maximum commands: 5101    LDIO threshold: 0
Mar 11 09:38:50 ms-be2088 kernel: [328086.730307] megaraid_sas 0000:98:00.0: Performance mode :Balanced
Mar 11 09:38:50 ms-be2088 kernel: [328086.730309] megaraid_sas 0000:98:00.0: FW supports sync cache     : Yes
Mar 11 09:38:50 ms-be2088 kernel: [328086.730318] megaraid_sas 0000:98:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Mar 11 09:38:50 ms-be2088 kernel: [328087.093145] megaraid_sas 0000:98:00.0: FW supports atomic descriptor      : Yes
Mar 11 09:38:51 ms-be2088 kernel: [328087.989131] megaraid_sas 0000:98:00.0: FW provided supportMaxExtLDs: 1    max_lds: 240
Mar 11 09:38:51 ms-be2088 kernel: [328087.989136] megaraid_sas 0000:98:00.0: controller type    : MR(8192MB)
Mar 11 09:38:51 ms-be2088 kernel: [328087.989139] megaraid_sas 0000:98:00.0: Online Controller Reset(OCR)       : Enabled
Mar 11 09:38:51 ms-be2088 kernel: [328087.989141] megaraid_sas 0000:98:00.0: Secure JBOD support        : Yes
Mar 11 09:38:51 ms-be2088 kernel: [328087.989144] megaraid_sas 0000:98:00.0: NVMe passthru support      : Yes
Mar 11 09:38:51 ms-be2088 kernel: [328087.989147] megaraid_sas 0000:98:00.0: FW provided TM TaskAbort/Reset timeout     : 6 secs/60 secs
Mar 11 09:38:51 ms-be2088 kernel: [328087.989149] megaraid_sas 0000:98:00.0: JBOD sequence map support  : Yes
Mar 11 09:38:51 ms-be2088 kernel: [328087.989151] megaraid_sas 0000:98:00.0: PCI Lane Margining support : Yes
Mar 11 09:38:51 ms-be2088 kernel: [328088.045182] megaraid_sas 0000:98:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Mar 11 09:38:51 ms-be2088 kernel: [328088.048022] megaraid_sas 0000:98:00.0: Adapter is OPERATIONAL for scsi:0
Mar 11 09:38:51 ms-be2088 kernel: [328088.048108] megaraid_sas 0000:98:00.0: Snap dump wait time        : 15
Mar 11 09:38:51 ms-be2088 kernel: [328088.048110] megaraid_sas 0000:98:00.0: Reset successful for scsi0.
Mar 11 09:38:51 ms-be2088 kernel: [328088.061760] scsi 0:2:11:0: Direct-Access     ATA      HGST HUS728T8TAL W9U0 PQ: 0 ANSI: 6
Mar 11 09:38:51 ms-be2088 kernel: [328088.064261] sd 0:2:11:0: Attached scsi generic sg13 type 0
Mar 11 09:38:51 ms-be2088 kernel: [328088.064820] sd 0:2:11:0: [sdl] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
Mar 11 09:38:51 ms-be2088 kernel: [328088.064826] sd 0:2:11:0: [sdl] 4096-byte physical blocks
Mar 11 09:38:51 ms-be2088 kernel: [328088.070620] sd 0:2:11:0: [sdl] Write Protect is off
Mar 11 09:38:51 ms-be2088 kernel: [328088.070625] sd 0:2:11:0: [sdl] Mode Sense: 6b 00 10 08
Mar 11 09:38:51 ms-be2088 kernel: [328088.071127] sd 0:2:11:0: [sdl] Write cache: enabled, read cache: enabled, supports DPO and FUA
Mar 11 09:38:51 ms-be2088 kernel: [328088.072336] sdl: detected capacity change from 0 to 8001563222016
Mar 11 09:38:51 ms-be2088 kernel: [328088.074102] megaraid_sas 0000:98:00.0: scanning for scsi0...
Mar 11 09:38:51 ms-be2088 kernel: [328088.106852] sdl: detected capacity change from 0 to 8001563222016
Mar 11 09:38:51 ms-be2088 kernel: [328088.156914]  sdl: sdl1
Mar 11 09:38:52 ms-be2088 kernel: [328088.193919] sdl: detected capacity change from 0 to 8001563222016
Mar 11 09:38:52 ms-be2088 kernel: [328088.193929] sd 0:2:11:0: [sdl] Attached SCSI disk
Mar 11 09:42:37 ms-be2088 kernel: [328313.705144] XFS (sdl1): Mounting V5 Filesystem
Mar 11 09:42:37 ms-be2088 kernel: [328313.872511] XFS (sdl1): Ending clean mount
Mar 11 09:42:37 ms-be2088 kernel: [328313.881009] xfs filesystem being mounted at /srv/swift-storage/objects11 supports timestamps until 2038-01-19 (0x7fffffff)

I guess one thing to do might be to do some I/O workload during a reset and check if there's interruption, but this is looking workable now.

I guess one thing to do might be to do some I/O workload during a reset and check if there's interruption, but this is looking workable now.

@MatthewVernon Do you have time to make the tests that you think are valuable and report back a more definitive answer? I'd need something along the lines of "yep fine to keep the controller" or "no I'd prefer to test/work on another one etc..). In the meantime I am going to work with Moritz to figure out if the storcli package can be imported in our apt :)

Thanks for the assistance and conversations guys.
Buona Fortuna con tutto il resto!

Opened T388628 to verify if we can use/import storcli in our apt repo.

Mentioned in SAL (#wikimedia-operations) [2025-03-12T11:11:26Z] <Emperor> fio testing on ms-be2088 while resetting controller T384003

Mentioned in SAL (#wikimedia-operations) [2025-03-12T11:50:27Z] <Emperor> fio testing on ms-be2088 24 disks at once T384003

I/O definitely pauses during a controller reset (for ~20s). Going to try stressing the disks harder to see if the system can cope with this I/O pause under heavier load.

Mentioned in SAL (#wikimedia-operations) [2025-03-12T13:01:54Z] <Emperor> fio testing on ms-be2088 24 disks at once whilst resetting the controller T384003

The system is stable, but all I/O to the disks is paused for ~18s during a disk reset.

I tested with the snappy command-line sudo fio --filename=/srv/swift-storage/objects0/fiotestfile:/srv/swift-storage/objects1/fiotestfile:/srv/swift-storage/objects2/fiotestfile:/srv/swift-storage/objects3/fiotestfile:/srv/swift-storage/objects4/fiotestfile:/srv/swift-storage/objects5/fiotestfile:/srv/swift-storage/objects6/fiotestfile:/srv/swift-storage/objects7/fiotestfile:/srv/swift-storage/objects8/fiotestfile:/srv/swift-storage/objects9/fiotestfile:/srv/swift-storage/objects10/fiotestfile:/srv/swift-storage/objects11/fiotestfile:/srv/swift-storage/objects12/fiotestfile:/srv/swift-storage/objects13/fiotestfile:/srv/swift-storage/objects14/fiotestfile:/srv/swift-storage/objects15/fiotestfile:/srv/swift-storage/objects16/fiotestfile:/srv/swift-storage/objects17/fiotestfile:/srv/swift-storage/objects18/fiotestfile:/srv/swift-storage/objects19/fiotestfile:/srv/swift-storage/objects20/fiotestfile:/srv/swift-storage/objects21/fiotestfile:/srv/swift-storage/objects22/fiotestfile:/srv/swift-storage/objects23/fiotestfile --size=500GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=360 --numjobs=48 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 which is basically doing r/w tests on a file on each swift hdd.

If you run this, performance figures look like:

throughput-test-job: (groupid=0, jobs=48): err= 0: pid=1115711: Wed Mar 12 12:31:04 2025                                                                       
  read: IOPS=5714, BW=357MiB/s (375MB/s)(42.1GiB/120627msec)                                                                                                   
    slat (usec): min=3, max=1041.0k, avg=4028.13, stdev=26471.99                                                                                               
    clat (usec): min=337, max=2195.7k, avg=284324.13, stdev=174860.76                                                                                          
     lat (usec): min=354, max=2195.7k, avg=288352.66, stdev=179449.24                                                                                          
    clat percentiles (msec):                                                                                                                                   
     |  1.00th=[    6],  5.00th=[   25], 10.00th=[   79], 20.00th=[  144],                                                                                     
     | 30.00th=[  188], 40.00th=[  226], 50.00th=[  264], 60.00th=[  305],                                                                                     
     | 70.00th=[  351], 80.00th=[  405], 90.00th=[  502], 95.00th=[  592],                                                                                     
     | 99.00th=[  844], 99.50th=[  961], 99.90th=[ 1167], 99.95th=[ 1250],                                                                                     
     | 99.99th=[ 1469]                                                                                                                                         
   bw (  KiB/s): min=47741, max=1524354, per=100.00%, avg=367994.34, stdev=4041.43, samples=11437
   iops        : min=  705, max=23794, avg=5728.30, stdev=63.21, samples=11437
  write: IOPS=5723, BW=358MiB/s (375MB/s)(42.1GiB/120627msec); 0 zone resets
    slat (usec): min=4, max=972453, avg=4092.36, stdev=26747.07
    clat (usec): min=344, max=3595.5k, avg=244244.63, stdev=297470.72
     lat (usec): min=366, max=3884.6k, avg=248337.38, stdev=311402.65
    clat percentiles (usec):
     |  1.00th=[    478],  5.00th=[    947], 10.00th=[   2089],
     | 20.00th=[  72877], 30.00th=[ 122160], 40.00th=[ 158335],
     | 50.00th=[ 191890], 60.00th=[ 227541], 70.00th=[ 267387],
     | 80.00th=[ 320865], 90.00th=[ 429917], 95.00th=[ 633340],
     | 99.00th=[1803551], 99.50th=[2231370], 99.90th=[2768241],
     | 99.95th=[2936013], 99.99th=[3271558]
   bw (  KiB/s): min=35525, max=1553217, per=100.00%, avg=368824.88, stdev=4266.91, samples=11434
   iops        : min=  517, max=24247, avg=5741.14, stdev=66.74, samples=11434
  lat (usec)   : 500=0.73%, 750=1.23%, 1000=0.69%
  lat (msec)   : 2=2.25%, 4=1.60%, 10=1.50%, 20=1.36%, 50=2.86%
  lat (msec)   : 100=6.47%, 250=37.39%, 500=35.23%, 750=5.84%, 1000=1.41%
  lat (msec)   : 2000=1.08%, >=2000=0.37%
  cpu          : usr=0.23%, sys=0.67%, ctx=194384, majf=0, minf=15245
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% 
     issued rwts: total=689362,690352,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=357MiB/s (375MB/s), 357MiB/s-357MiB/s (375MB/s-375MB/s), io=42.1GiB (45.2GB), run=120627-120627msec
  WRITE: bw=358MiB/s (375MB/s), 358MiB/s-358MiB/s (375MB/s-375MB/s), io=42.1GiB (45.2GB), run=120627-120627msec

If you run this whilst in another window running while sleep 60 ; do sudo ~elukey/storcli64 /c0 restart ; done (i.e. restart the controller every 60s), in the running output you see an 18s freeze each time the reset happens, and the aggregate output looks like:

throughput-test-job: (groupid=0, jobs=48): err= 0: pid=1120666: Wed Mar 12 13:07:30 2025                                                                       
  read: IOPS=4743, BW=296MiB/s (311MB/s)(104GiB/360582msec)                                                                                                    
    slat (usec): min=3, max=19197k, avg=4942.93, stdev=161645.01                                                                                               
    clat (usec): min=322, max=19847k, avg=344537.21, stdev=1290637.04                                                                                          
     lat (usec): min=338, max=20160k, avg=349480.55, stdev=1301106.18                                                                                          
    clat percentiles (msec):                                                                                                                                   
     |  1.00th=[    6],  5.00th=[   21], 10.00th=[   55], 20.00th=[  117],                                                                                     
     | 30.00th=[  161], 40.00th=[  199], 50.00th=[  236], 60.00th=[  275],                                                                                     
     | 70.00th=[  317], 80.00th=[  376], 90.00th=[  468], 95.00th=[  567],                                                                                     
     | 99.00th=[  902], 99.50th=[ 1418], 99.90th=[17113], 99.95th=[17113],                                                                                     
     | 99.99th=[17113]                                                                                                                                         
   bw (  KiB/s): min=34963, max=1857746, per=100.00%, avg=406232.41, stdev=4431.98, samples=25686                                                              
   iops        : min=  503, max=29004, avg=6324.94, stdev=69.28, samples=25686                                                                                 
  write: IOPS=4743, BW=296MiB/s (311MB/s)(104GiB/360582msec); 0 zone resets                                                                                    
    slat (usec): min=4, max=19120k, avg=4763.73, stdev=148427.63                                                                                               
    clat (usec): min=25, max=21203k, avg=293194.13, stdev=1210024.27                                                                                           
     lat (usec): min=371, max=21311k, avg=297958.27, stdev=1222731.86                                                                                          
    clat percentiles (usec):                                                                                                                                   
     |  1.00th=[     482],  5.00th=[     799], 10.00th=[    1565],                                                                                             
     | 20.00th=[   45876], 30.00th=[   98042], 40.00th=[  135267],                                                                                             
     | 50.00th=[  168821], 60.00th=[  204473], 70.00th=[  242222],
     | 80.00th=[  295699], 90.00th=[  413139], 95.00th=[  633340],
     | 99.00th=[ 1904215], 99.50th=[ 2432697], 99.90th=[17112761],
     | 99.95th=[17112761], 99.99th=[17112761]
   bw (  KiB/s): min=30523, max=1867217, per=100.00%, avg=406499.69, stdev=4676.45, samples=25674
   iops        : min=  433, max=29151, avg=6328.71, stdev=73.11, samples=25674
  lat (usec)   : 50=0.01%, 500=0.71%, 750=1.64%, 1000=0.86%
  lat (msec)   : 2=2.64%, 4=1.65%, 10=1.67%, 20=1.76%, 50=4.04%
  lat (msec)   : 100=8.69%, 250=39.03%, 500=29.77%, 750=4.65%, 1000=1.14%
  lat (msec)   : 2000=1.06%, >=2000=0.68%
  cpu          : usr=0.21%, sys=0.59%, ctx=610846, majf=0, minf=29467
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% 
     issued rwts: total=1710411,1710477,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=296MiB/s (311MB/s), 296MiB/s-296MiB/s (311MB/s-311MB/s), io=104GiB (112GB), run=360582-360582msec
  WRITE: bw=296MiB/s (311MB/s), 296MiB/s-296MiB/s (311MB/s-311MB/s), io=104GiB (112GB), run=360582-360582msec

So you can see the the upper quantiles of I/O times are much much slower (because all I/O is stopped for 18s every time the controller restarts).

@MatthewVernon thanks a lot for the detailed tests, now I think we need to decide if this is a ok-behavior for the sporadic replace-disks tasks, or if we want to test a different controller with pure passthrough (that hopefully doesn't need a restart like this etc..). What is your preference? I already have a quote for the new controller, if you want to pursue this road we can order one for ms-be2088 and re-do the tests.

Mentioned in SAL (#wikimedia-operations) [2025-03-12T15:17:06Z] <Emperor> storcli64 /c0 restart on ms-be1090 T384003

Because I love playing with 🔥, I tried a controller reset on a live swift host (ms-be1090) to see how swift coped. As you'd expect, all I/O pauses for about 18s, but does then seem to have resumed without issue, and I didn't see any bump in front-end errors.

So, I think we could live with this controller, although it's obviously not ideal.

@elukey I think given we've already got a quote for a controller that we think? know? will avoid this problem and let us do proper hot-swaps, it would be well worth actually testing that. If that does work, I think we'd then want to look at the cost of retro-fitting to the existing nodes and/or updating Config J to include that controller that let us actually hot-swap disks. Seem sensible?

wiki_willy mentioned this in Unknown Object (Task).Mar 12 2025, 11:20 PM

@Papaul @Jhancock.wm is it worth to perform another swap test like in T388684 to see if the controller does its job without a restart?

i unfortunately cannot find a spare 8 TB drive. So we'd either need to try it with a 4 TB or source a disk.

i unfortunately cannot find a spare 8 TB drive. So we'd either need to try it with a 4 TB or source a disk.

I think it is fine to do it with 4TB, the important bit is to see if the new disk is properly recognized, hopefully it shouldn't matter how big it is. What do you think @MatthewVernon ?

Yeah, I doubt the size of disk is critical here (as long as we end up with an 8T disk back in when we're done testing :) )

We can try this today then, I have plenty of 4TB disks on hand that we can try.

RobH mentioned this in Unknown Object (Task).Apr 4 2025, 5:30 PM
RobH mentioned this in Unknown Object (Task).

@Jhancock.wm sorry I was afk! Please do it anytime, the host is not serving prod traffic.

@elukey all good! yesterday was rack unpacking day and i did almost nothing else =#

i replaced a random drive with a 4GB.

Thanks Jenn!

I see that the new disk is listed as "Good" this time (as opposed to "Bad"), but I think we'll still need the controller restart:

250:4    36 UGood F  3.638 TB SATA HDD N   N  512B TOSHIBA MG04ACA400NY U  -   

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s4 set jbod
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Failure
Description = Set Drive JBOD Failed.

Detailed Status :
===============

-------------------------------------------------------------------------
Drive       Status  ErrCd ErrMsg                                         
-------------------------------------------------------------------------
/c0/e250/s4 Failure    50 device state doesn't support requested command 
-------------------------------------------------------------------------

But then I tried the following:

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s4 show initialization
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Show Drive Initialization Status Succeeded.


------------------------------------------------------
Drive-ID    Progress% Status      Estimated Time Left 
------------------------------------------------------
/c0/e250/s4         0 In progress 0 Seconds           
------------------------------------------------------

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s4 show initialization
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Show Drive Initialization Status Succeeded.


------------------------------------------------------
Drive-ID    Progress% Status      Estimated Time Left 
------------------------------------------------------
/c0/e250/s4         0 In progress 0 Seconds           
------------------------------------------------------
[..]

At some point the Controller mentioned 22 hours etc.. to complete, so I issued the stop initialization command since I thought it was a dead end. Then I retried set jbod and...

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s4 stop initialization
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Stop Drive Initialization Succeeded.

elukey@ms-be2088:~$ sudo ./storcli64 /c0/e250/s4 set jbod
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 5.10.0-34-amd64
Controller = 0
Status = Success
Description = Set Drive JBOD Succeeded.

And indeed:

 4 250:4    36 Onln  SATA HDD 3.638 TB 512B TOSHIBA MG04ACA400NY ATA      C0.0 & C0.1 

[Wed Apr  9 07:34:38 2025] scsi 0:2:4:0: Direct-Access     ATA      TOSHIBA MG04ACA4 FK5D PQ: 0 ANSI: 6
[Wed Apr  9 07:34:38 2025] sd 0:2:4:0: [sdd] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[Wed Apr  9 07:34:39 2025] sd 0:2:4:0: [sdd] Write Protect is off
[Wed Apr  9 07:34:39 2025] sd 0:2:4:0: [sdd] Mode Sense: 6b 00 10 08
[Wed Apr  9 07:34:39 2025] sd 0:2:4:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[Wed Apr  9 07:34:39 2025] sdd: detected capacity change from 0 to 4000787030016
[Wed Apr  9 07:34:39 2025] sd 0:2:4:0: Attached scsi generic sg6 type 0
[Wed Apr  9 07:34:39 2025] sdd: detected capacity change from 0 to 4000787030016
[Wed Apr  9 07:34:39 2025] sdd: detected capacity change from 0 to 4000787030016
[Wed Apr  9 07:34:39 2025] sd 0:2:4:0: [sdd] Attached SCSI disk

@Jhancock.wm could you please restore the old disk? So I'll make the same test..

I can confirm that using start initialization and stopping it right afterwards makes set jbod working, without a controller restart. Why? I have no idea..

MoritzMuehlenhoff closed subtask Restricted Task as Resolved.Apr 29 2025, 1:05 PM

@elukey do we still need this ticket open for testing?

elukey claimed this task.