Page MenuHomePhabricator

Testing Out Hard Drive on Swift Server
Closed, ResolvedPublicRequest

Description

This task is to test out the Seagate IronWolf 8TB NAS Internal Hard Drive HDD – 3.5 Inch SATA 6Gb/s 7200 RPM 256MB Cache ordered from Amazon via T328083, to see if it works properly on one of the swift servers. Please ping @MatthewVernon to coordinate which server he wants to test it out on. If everything works well, please let @RobH know for placing additional spares to keep onsite at both eqiad and codfw.

Thanks,
Willy

Related Objects

StatusSubtypeAssignedTask
ResolvedRequestJclark-ctr

Event Timeline

wiki_willy renamed this task from hw troubleshooting: <type of hardware failure> for <fqdn of server> to Testing Out Hard Drive on Swift Server.Feb 9 2023, 5:06 PM
wiki_willy created this task.
wiki_willy mentioned this in Unknown Object (Task).Feb 9 2023, 5:08 PM
wiki_willy added a parent task: Unknown Object (Task).

@MatthewVernon Can you advise when and what Server you would like to test in

Hi, sorry I've been on leave, and now we're approaching the switchover. Can we do it after that, say Thursday 9th March, at whatever is the earliest comfortable time of day for you?

@MatthewVernon Will you be available for the swap tomorrow?

Yes, please. I've unmounted a drive in ms-be1066 and turned on the locator light
sudo megacli -PDLocate -PhysDrv [32:15] -a0

So please go ahead.

Mentioned in SAL (#wikimedia-operations) [2023-03-09T14:08:36Z] <Emperor> testing disk-swap in ms-be1066 T329305

Something has gone a bit awry, the kernel reports problems with two other drives instead:

Mar  9 14:13:57 ms-be1066 kernel: [11683056.185701] sd 0:2:4:0: [sdf] tag#699 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=4s
Mar  9 14:14:00 ms-be1066 kernel: [11683059.173114] sd 0:2:25:0: [sdz] tag#897 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s

Looking at these drives -

sdz is bus info: scsi@0:2.25.0
Target Id: 25 is Enclosure Device ID: 32 Slot Number: 23
sdf is still absent but scsi@0:2.17.0 is missing
Target Id: 17 missing, as is Slot Number: 2

The drive I wanted swapping:

sdc is bus info: scsi@0:2.0.0
Target Id: 0 is still Enclosure Device ID: 32 Slot Number: 15

I don't know how obvious the slot numbers are in the hardware; is it plausible that Slot Number 2 was the one you removed (and that Slot Number 23 got jogged in the process?)

slot 2 is right by the handle. possibly

Replaced drive slot 15 with test drive

Can you check the drives in slots 23 and 2 are seated proper please? the kernel still can't see them.

[after a reboot the drive in slot 2 was in a "Foreign" state; clearing that made it possible to reintroduce it with sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 and the filesystem recovered OK.

The swapped-in drive seems OK initially, I'll get swift to start using it shortly.

Change 896124 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: bring ms-be1066 sdr1 back into service

https://gerrit.wikimedia.org/r/896124

Change 896124 merged by MVernon:

[operations/puppet@production] swift: bring ms-be1066 sdr1 back into service

https://gerrit.wikimedia.org/r/896124

Hi @MatthewVernon & @Jclark-ctr - if this sample drive looks good, let me know and we'll work on ordering a bunch more to keep them onsite as spares.

Thanks,
Willy

No complaints from me, thanks, drive is now 25% loaded and behaving fine.

Thanks for confirming @MatthewVernon. ++@RobH to order spares for both eqiad and codfw

No complaints from me, thanks, drive is now 25% loaded and behaving fine.

RobH mentioned this in Unknown Object (Task).Mar 14 2023, 12:12 PM
RobH mentioned this in Unknown Object (Task).