Page MenuHomePhabricator

ms-be2012.codfw.wmnet: slot=12 dev=sdm failed
Closed, ResolvedPublic

Description

slot=12 dev=sdm has been reported failed, please replace.

/var/log/kern.log

May 27 09:31:18 ms-be2012 kernel: [6038057.279151] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:31:48 ms-be2012 kernel: [6038087.354624] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:32:19 ms-be2012 kernel: [6038117.434086] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:32:49 ms-be2012 kernel: [6038147.513599] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:33:19 ms-be2012 kernel: [6038177.593056] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:33:49 ms-be2012 kernel: [6038207.672507] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:34:19 ms-be2012 kernel: [6038237.751934] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:34:49 ms-be2012 kernel: [6038267.831400] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:35:19 ms-be2012 kernel: [6038297.910876] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:35:49 ms-be2012 kernel: [6038327.994331] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:36:19 ms-be2012 kernel: [6038358.069800] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:36:49 ms-be2012 kernel: [6038388.149282] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:37:19 ms-be2012 kernel: [6038418.228730] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:37:49 ms-be2012 kernel: [6038448.308216] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:38:20 ms-be2012 kernel: [6038478.387675] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:38:50 ms-be2012 kernel: [6038508.467108] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:39:20 ms-be2012 kernel: [6038538.546561] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:39:50 ms-be2012 kernel: [6038568.629731] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:40:20 ms-be2012 kernel: [6038598.705486] XFS (sdm3): xfs_log_force: error 5 returned.
May 27 09:40:50 ms-be2012 kernel: [6038628.784981] XFS (sdm3): xfs_log_force: error 5 returned.

smartctl

megacli

^M                                     
Enclosure Device ID: 32
Slot Number: 12
Enclosure position: 1
Device Id: 12
WWN: 5001517387e43576
Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 149.049 GB [0x12a19eb0 Sectors]
Non Coerced Size: 148.549 GB [0x12919eb0 Sectors]
Coerced Size: 148.5 GB [0x12900000 Sectors]
Sector Size:  0
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0362
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b36789abea
Connected Port Number: 0(path0) 
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: Foreign 
Foreign Secure: Drive is not secured by a foreign lock key
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Solid State Device
Drive:  Not Certified
Drive Temperature : N/A
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 3.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No




Exit Code: 0x00

Event Timeline

fgiunchedi updated the task description. (Show Details)
fgiunchedi added a project: ops-codfw.
fgiunchedi subscribed.
Restricted Application added subscribers: Zppix, Southparkfan, Aklapper. · View Herald Transcript

more errors from the failure

May 26 15:32:00 ms-be2012 kernel: [5973299.941810] sd 0:2:12:0: [sdm]  
May 26 15:32:00 ms-be2012 kernel: [5973299.941834] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:00 ms-be2012 kernel: [5973299.941838] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:00 ms-be2012 kernel: [5973299.941840] Write(10): 2a 00 06 8b a0 b0 00 00 10 00
May 26 15:32:00 ms-be2012 kernel: [5973299.941852] blk_update_request: 1219 callbacks suppressed
May 26 15:32:00 ms-be2012 kernel: [5973299.941856] end_request: I/O error, dev sdm, sector 109813936
May 26 15:32:00 ms-be2012 kernel: [5973299.948509] sd 0:2:12:0: [sdm]  
May 26 15:32:00 ms-be2012 kernel: [5973299.948517] md/raid1:md0: Disk failure on sdm1, disabling device.
May 26 15:32:00 ms-be2012 kernel: [5973299.948517] md/raid1:md0: Operation continuing on 1 devices.
May 26 15:32:00 ms-be2012 kernel: [5973299.962049] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:00 ms-be2012 kernel: [5973299.962061] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:00 ms-be2012 kernel: [5973299.962065] Write(10): 2a 00 06 8b 9d 10 00 00 18 00
May 26 15:32:00 ms-be2012 kernel: [5973299.962087] end_request: I/O error, dev sdm, sector 109813008
May 26 15:32:00 ms-be2012 kernel: [5973299.968943] sd 0:2:12:0: [sdm]  
May 26 15:32:00 ms-be2012 kernel: [5973299.968946] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:00 ms-be2012 kernel: [5973299.968949] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:00 ms-be2012 kernel: [5973299.968950] Write(10): 2a 00 03 7d 01 30 00 00 68 00
May 26 15:32:00 ms-be2012 kernel: [5973299.968957] end_request: I/O error, dev sdm, sector 58523952
May 26 15:32:02 ms-be2012 kernel: [5973301.921397] RAID1 conf printout:
May 26 15:32:02 ms-be2012 kernel: [5973301.921404]  --- wd:1 rd:2
May 26 15:32:02 ms-be2012 kernel: [5973301.921407]  disk 0, wo:1, o:0, dev:sdm1
May 26 15:32:02 ms-be2012 kernel: [5973301.921410]  disk 1, wo:0, o:1, dev:sdn1
May 26 15:32:02 ms-be2012 kernel: [5973301.929520] sd 0:2:12:0: [sdm]  
May 26 15:32:02 ms-be2012 kernel: [5973301.929523] sd 0:2:12:0: [sdm]  
May 26 15:32:02 ms-be2012 kernel: [5973301.929527] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:02 ms-be2012 kernel: [5973301.929530] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:02 ms-be2012 kernel: [5973301.929535] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:02 ms-be2012 kernel: [5973301.929538] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:02 ms-be2012 kernel: [5973301.929531] Read(10):
May 26 15:32:02 ms-be2012 kernel: [5973301.929541] Read(10): 28 00 07 64 a5 50 00 00 08 00
May 26 15:32:02 ms-be2012 kernel: [5973301.929557] end_request: I/O error, dev sdm, sector 124036432
May 26 15:32:02 ms-be2012 kernel: [5973301.936177]  28 00 0a 28 8e 80 00 00 08 00
May 26 15:32:02 ms-be2012 kernel: [5973301.936186] end_request: I/O error, dev sdm, sector 170430080
May 26 15:32:02 ms-be2012 kernel: [5973301.941202] sd 0:2:12:0: [sdm]  
May 26 15:32:02 ms-be2012 kernel: [5973301.941204] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:02 ms-be2012 kernel: [5973301.941205] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:02 ms-be2012 kernel: [5973301.941209] Read(10): 28 00 07 64 a5 50 00 00 08 00
May 26 15:32:02 ms-be2012 kernel: [5973301.941211] end_request: I/O error, dev sdm, sector 124036432
May 26 15:32:02 ms-be2012 kernel: [5973301.944222] scanning ...
May 26 15:32:02 ms-be2012 kernel: [5973301.958239] RAID1 conf printout:
May 26 15:32:02 ms-be2012 kernel: [5973301.958244]  --- wd:1 rd:2
May 26 15:32:02 ms-be2012 kernel: [5973301.958248]  disk 1, wo:0, o:1, dev:sdn1
May 26 15:32:03 ms-be2012 kernel: [5973302.923148] sd 0:2:12:0: [sdm]  
May 26 15:32:03 ms-be2012 kernel: [5973302.923154] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:03 ms-be2012 kernel: [5973302.923157] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:03 ms-be2012 kernel: [5973302.923158] Read(10): 28 00 0a 56 e3 18 00 00 18 00
May 26 15:32:03 ms-be2012 kernel: [5973302.923167] end_request: I/O error, dev sdm, sector 173466392
May 26 15:32:03 ms-be2012 kernel: [5973302.929835] sd 0:2:12:0: [sdm]  
May 26 15:32:03 ms-be2012 kernel: [5973302.929837] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:03 ms-be2012 kernel: [5973302.929839] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:03 ms-be2012 kernel: [5973302.929840] Read(10): 28 00 0a 28 8e 80 00 00 08 00
May 26 15:32:03 ms-be2012 kernel: [5973302.929846] end_request: I/O error, dev sdm, sector 170430080
May 26 15:32:03 ms-be2012 kernel: [5973302.936673] sd 0:2:12:0: [sdm]  
May 26 15:32:03 ms-be2012 kernel: [5973302.936674] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May 26 15:32:03 ms-be2012 kernel: [5973302.936676] sd 0:2:12:0: [sdm] CDB: 
May 26 15:32:03 ms-be2012 kernel: [5973302.936677] Read(10): 28 00 0a 56 e3 18 00 00 08 00
May 26 15:32:03 ms-be2012 kernel: [5973302.936683] end_request: I/O error, dev sdm, sector 173466392
May 26 15:32:03 ms-be2012 kernel: [5973302.966607] quiet_error: 1242 callbacks suppressed
May 26 15:32:03 ms-be2012 kernel: [5973302.966614] Buffer I/O error on device sdm3, logical block 12030320
May 26 15:32:03 ms-be2012 kernel: [5973302.973853] lost page write due to I/O error on sdm3
May 26 15:32:03 ms-be2012 kernel: [5973302.973864] Buffer I/O error on device sdm3, logical block 6019719
May 26 15:32:03 ms-be2012 kernel: [5973302.980995] lost page write due to I/O error on sdm3
May 26 15:32:03 ms-be2012 kernel: [5973302.981450] XFS (sdm3): metadata I/O error: block 0x5bc2b64 ("xlog_iodone") error 5 numblks 64
May 26 15:32:03 ms-be2012 kernel: [5973302.991304] XFS (sdm3): xfs_do_force_shutdown(0x2) called from line 1170 of file /build/linux-03BQvT/linux-3.13.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffa03cb911
May 26 15:32:03 ms-be2012 kernel: [5973302.991366] XFS (sdm3): Log I/O Error Detected.  Shutting down filesystem
May 26 15:32:03 ms-be2012 kernel: [5973302.999158] XFS (sdm3): Please umount the filesystem and rectify the problem(s)
May 26 15:32:03 ms-be2012 kernel: [5973303.007589] XFS (sdm3): xfs_log_force: error 5 returned.
May 26 15:32:04 ms-be2012 kernel: [5973304.194652] sd 0:2:12:0: [sdm] Synchronizing SCSI cache
May 26 15:32:04 ms-be2012 kernel: [5973304.199944] md/raid1:md1: Disk failure on sdm2, disabling device.
May 26 15:32:04 ms-be2012 kernel: [5973304.199944] md/raid1:md1: Operation continuing on 1 devices.
May 26 15:32:04 ms-be2012 kernel: [5973304.200176] md: unbind<sdm1>
May 26 15:32:04 ms-be2012 kernel: [5973304.213635] RAID1 conf printout:
fgiunchedi added a subscriber: Papaul.

@Papaul also note that this an ssd, not a spinning disk as usual with swift failures

Papaul triaged this task as Medium priority.May 27 2016, 2:48 PM
Papaul mentioned this in Unknown Object (Task).
Papaul set Security to None.

So we have 16 300GB Intel 320 series on the shelf. Since the older 320 series is NOT used in any new systems, and only used for spare replacements, I'd suggest we simply swap out the defective 160GB withthe 300GB.

Since the replacement disk is larger, and also SSD (high speed), there is no actual detriment to the upgrade (other than 300GB cost, already paid for and on shelf spares for systems like this.)

I'd suggest we also not bother to add more to the spare count for 300GB 320 series.

I just synced up with @mark about this via IRC. He is aware that I've advised we use the 300GB spare rather than order a new 160GB for replacement.

Also chatted with @Papaul via irc, he is aware of this update (to use the 300GB ssd).

The defective SSD cannot be wiped, and degaussing won't erase SSD medium. Please label the disk as defective with permanent marker. we'll need to collect them and then pay for destruction. Alternatively, if @Papaul knows anyone with a drill press, we can buy a nice drill bit for it and have him destroy the disks by putting multiple holes through it.

Disk replacement complete

Mentioned in SAL [2016-06-13T15:05:37Z] <godog> reboot ms-be2012 to fix disk ordering T136395

replaced sdm, raid rebuilt