Page MenuHomePhabricator

Broken disk on analytics1056
Closed, ResolvedPublic

Description

One disk seems to have failed after the last reboot for rack A3 maintenance:

Enclosure Device ID: 32
Slot Number: 5
Drive's position: DiskGroup: 6, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 5
WWN: 500003961b703be0
Sequence Number: 2
Media Error Count: 68         <===================================
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: FL1H
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b31234abc5
Connected Port Number: 0(path0)
Inquiry Data:            25M1K89EFTOSHIBA MG03ACA400                          FL1H
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :32C (89.60 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
[   66.778672] megaraid_sas 0000:03:00.0: 10088 (601057004s/0x0001/FATAL) - VD bad block table on VD 06/6 is full; unable to log block 9a1 (on PD 05(e0x20/s5) at 9a1)
[   71.906483] megaraid_sas 0000:03:00.0: 10091 (601057010s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x20/s5) at 9a1
[   71.920728] megaraid_sas 0000:03:00.0: 10092 (601057010s/0x0001/FATAL) - VD bad block table on VD 06/6 is full; unable to log block 9a1 (on PD 05(e0x20/s5) at 9a1)
[   77.130387] megaraid_sas 0000:03:00.0: 10095 (601057015s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x20/s5) at 9a1
[   77.144628] megaraid_sas 0000:03:00.0: 10096 (601057015s/0x0001/FATAL) - VD bad block table on VD 06/6 is full; unable to log block 9a1 (on PD 05(e0x20/s5) at 9a1)
[   82.362999] megaraid_sas 0000:03:00.0: 10099 (601057020s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x20/s5) at 9a1
[   82.377253] megaraid_sas 0000:03:00.0: 10100 (601057020s/0x0001/FATAL) - VD bad block table on VD 06/6 is full; unable to log block 9a1 (on PD 05(e0x20/s5) at 9a1)
[   87.595489] megaraid_sas 0000:03:00.0: 10103 (601057025s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x20/s5) at 9a1
[   87.609757] megaraid_sas 0000:03:00.0: 10104 (601057025s/0x0001/FATAL) - VD bad block table on VD 06/6 is full; unable to log block 9a1 (on PD 05(e0x20/s5) at 9a1)
[   92.869591] megaraid_sas 0000:03:00.0: 10107 (601057031s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x20/s5) at 9a1
[   92.883824] megaraid_sas 0000:03:00.0: 10108 (601057031s/0x0001/FATAL) - VD bad block table on VD 06/6 is full; unable to log block 9a1 (on PD 05(e0x20/s5) at 9a1)
[   98.077223] megaraid_sas 0000:03:00.0: 10111 (601057036s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 05(e0x20/s5) at 9a1
[   98.083585] sd 0:2:6:0: [sdg] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   98.083589] sd 0:2:6:0: [sdg] tag#0 Sense Key : Medium Error [current]
[   98.083592] sd 0:2:6:0: [sdg] tag#0 Add. Sense: No additional sense information
[   98.083596] sd 0:2:6:0: [sdg] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 09 a0 00 00 00 08 00 00
[   98.083599] blk_update_request: I/O error, dev sdg, sector 2464
[   98.083601] Buffer I/O error on dev sdg1, logical block 52, async page read

Event Timeline

elukey triaged this task as Medium priority.Jan 17 2019, 4:22 PM
elukey created this task.

Change 485051 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove /var/lib/hadoop/g from analytics1056's Hadoop conf

https://gerrit.wikimedia.org/r/485051

Change 485051 merged by Elukey:
[operations/puppet@production] Remove /var/lib/hadoop/g from analytics1056's Hadoop conf

https://gerrit.wikimedia.org/r/485051

disk is replaced but shows as unconfigured (good)

Change 486434 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove hiera host overrides for analytics1056 after disk swap

https://gerrit.wikimedia.org/r/486434

Change 486434 merged by Elukey:
[operations/puppet@production] Remove hiera host overrides for analytics1056 after disk swap

https://gerrit.wikimedia.org/r/486434

Mentioned in SAL (#wikimedia-operations) [2019-01-25T07:51:40Z] <elukey> restart yarn/hdfs daemons on analytics1056 to pick up new disk settings - T214057