Page MenuHomePhabricator

sda failure in hydrogen.wikimedia.org
Closed, ResolvedPublic

Description

While reimaging hydrogen.w.o with stretch, the following error arose:

[ 1436.845879] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1436.852329] ata1.00: BMDMA stat 0x24
[ 1436.855915] ata1.00: failed command: READ DMA EXT
[ 1436.860663] ata1.00: cmd 25/00:00:00:09:a8/00:04:17:00:00/e0 tag 0 dma 524288 in
                        res 51/40:e8:18:09:a8/40:03:17:00:00/e0 Emask 0x9 (media error)
[ 1436.875802] ata1.00: status: { DRDY ERR }
[ 1436.879811] ata1.00: error: { UNC }
[ 1436.962250] ata1.00: configured for UDMA/133
[ 1436.962419] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 1436.962421] sd 0:0:0:0: [sda] tag#0 Sense Key : Medium Error [current]
[ 1436.962423] sd 0:0:0:0: [sda] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[ 1436.962426] sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 17 a8 09 00 00 04 00 00
[ 1436.962428] blk_update_request: I/O error, dev sda, sector 396888344
[ 1436.968833] ata1: EH complete

Related Objects

Event Timeline

Vgutierrez triaged this task as Medium priority.Apr 16 2018, 1:20 PM
Vgutierrez created this task.

SMART info about sda:

root@hydrogen:~# smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3 RE
Device Model:     SAMSUNG HE502HJ
Serial Number:    S2B6J90ZC12922
LU WWN Device Id: 5 0024e9 20426a30e
Add. Product Id:  DELL(tm)
Firmware Version: 1AJ30001
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Apr 16 13:18:33 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 4800) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  80) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       770
  2 Throughput_Performance  0x0026   056   056   000    Old_age   Always       -       4390
  3 Spin_Up_Time            0x0023   083   083   025    Pre-fail  Always       -       5311
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       35510
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       22
 13 Read_Soft_Error_Rate    0x003a   100   100   000    Old_age   Always       -       0
191 G-Sense_Error_Rate      0x0022   252   252   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       22
194 Temperature_Celsius     0x0002   064   064   000    Old_age   Always       -       27 (Min/Max 16/32)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0030   252   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       511
240 Head_Flying_Hours       0x0032   100   100   000    Old_age   Always       -       35510
241 Total_LBAs_Written      0x0032   097   094   000    Old_age   Always       -       4249882
242 Total_LBAs_Read         0x0032   098   095   000    Old_age   Always       -       3905359
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       1

SMART Error Log Version: 1
ATA Error Count: 10 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 10 occurred at disk power-on lifetime: 35509 hours (1479 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 4f c1 08 80 e6  Error: UNC 79 sectors at LBA = 0x068008c1 = 109054145

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 08      11:55:49.666  READ DMA
  25 00 08 08 90 1a e0 08      11:55:49.666  READ DMA EXT
  27 00 00 00 00 00 e0 08      11:55:49.666  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 08      11:55:49.666  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08      11:55:49.666  SET FEATURES [Set transfer mode]

Error 9 occurred at disk power-on lifetime: 35509 hours (1479 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 4f c1 08 80 e6  Error: UNC 79 sectors at LBA = 0x068008c1 = 109054145

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 08 08 00 e0 08      11:55:49.666  READ DMA
  ea 00 00 08 08 00 e0 08      11:55:49.663  FLUSH CACHE EXT
  ca 00 01 08 08 00 e0 08      11:55:49.663  WRITE DMA
  ea 00 00 12 08 00 e0 08      11:55:49.663  FLUSH CACHE EXT
  ea 00 00 12 08 00 e0 08      11:55:49.663  FLUSH CACHE EXT

Error 8 occurred at disk power-on lifetime: 23580 hours (982 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 30 f1 9c e4  Error: UNC 8 sectors at LBA = 0x049cf130 = 77394224

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 f1 9c e4 08      00:00:04.285  READ DMA
  c8 00 08 28 f1 9c e4 08      00:00:04.285  READ DMA
  ca 00 08 28 f1 9c e4 08      00:00:04.285  WRITE DMA
  27 00 00 00 00 00 e0 08      00:00:04.285  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 08      00:00:04.285  IDENTIFY DEVICE

Error 7 occurred at disk power-on lifetime: 23580 hours (982 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 07 29 f1 9c e4  Error: UNC 7 sectors at LBA = 0x049cf129 = 77394217

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 28 f1 9c e4 08      00:00:04.284  READ DMA
  ca 00 10 e0 39 04 e0 08      00:00:04.284  WRITE DMA
  ca 00 08 98 3a 04 e0 08      00:00:04.284  WRITE DMA
  c8 00 08 20 f1 9c e4 08      00:00:04.284  READ DMA
  35 00 08 b0 ee 71 e0 08      00:00:04.284  WRITE DMA EXT

Error 6 occurred at disk power-on lifetime: 23580 hours (982 days + 12 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 30 71 9b e4  Error: UNC 8 sectors at LBA = 0x049b7130 = 77295920

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 71 9b e4 08      00:00:04.280  READ DMA
  c8 00 08 28 71 9b e4 08      00:00:04.280  READ DMA
  ca 00 08 28 71 9b e4 08      00:00:04.280  WRITE DMA
  27 00 00 00 00 00 e0 08      00:00:04.280  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 08      00:00:04.280  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         2         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This servers warranty expired in 2014 and should be replaced instead of repaired. @faidon please comment.

Yup, a replacement is underway as part of T189317 :)