Page MenuHomePhabricator

Degraded RAID on cp5010
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host cp5010. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0]
      9756672 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2019, 1:42 AM

kern.log is reporting multiple failures in /dev/sdb3 as well

Jan 21 06:41:47 cp5010 kernel: [7490330.204759] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure
Jan 21 06:41:48 cp5010 kernel: [7490331.199323] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure
Jan 21 06:41:49 cp5010 kernel: [7490332.212809] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure
vgutierrez@cp5010:~$ grep "IO failure" /var/log/kern.log |wc -l
15916

initial failure at 01:39:

vgutierrez@cp5010:~$ grep sdb /var/log/kern.log |grep -v "__ext4_get_inode_loc" |grep -v "IO failure"
Jan 21 01:39:17 cp5010 kernel: [7472180.491194] blk_update_request: I/O error, dev sdb, sector 2056
Jan 21 01:39:17 cp5010 kernel: [7472180.498009] md/raid1:md0: Disk failure on sdb1, disabling device.
Jan 21 01:39:17 cp5010 kernel: [7472180.517585] blk_update_request: I/O error, dev sdb, sector 19557568
Jan 21 01:39:17 cp5010 kernel: [7472180.524786] Buffer I/O error on dev sdb3, logical block 2968, lost async page write
Jan 21 01:39:17 cp5010 kernel: [7472180.539578] blk_update_request: I/O error, dev sdb, sector 1376643888
Jan 21 01:39:17 cp5010 kernel: [7472180.546977] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 172080489)
Jan 21 01:39:17 cp5010 kernel: [7472180.546980] Buffer I/O error on device sdb3, logical block 169638758
Jan 21 01:39:17 cp5010 kernel: [7472180.554269] Buffer I/O error on device sdb3, logical block 169638759
Jan 21 01:39:17 cp5010 kernel: [7472180.561567] Buffer I/O error on device sdb3, logical block 169638760
Jan 21 01:39:17 cp5010 kernel: [7472180.574918] blk_update_request: I/O error, dev sdb, sector 1377903848
Jan 21 01:39:17 cp5010 kernel: [7472180.574938] blk_update_request: I/O error, dev sdb, sector 1385932840
Jan 21 01:39:17 cp5010 kernel: [7472180.574941] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173241606)
Jan 21 01:39:17 cp5010 kernel: [7472180.574943] Buffer I/O error on device sdb3, logical block 170799877
Jan 21 01:39:17 cp5010 kernel: [7472180.574948] blk_update_request: I/O error, dev sdb, sector 1387288288
Jan 21 01:39:17 cp5010 kernel: [7472180.574949] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173411037)
Jan 21 01:39:17 cp5010 kernel: [7472180.574950] Buffer I/O error on device sdb3, logical block 170969308
Jan 21 01:39:17 cp5010 kernel: [7472180.574953] blk_update_request: I/O error, dev sdb, sector 1387358600
Jan 21 01:39:17 cp5010 kernel: [7472180.574954] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173419826)
Jan 21 01:39:17 cp5010 kernel: [7472180.574955] Buffer I/O error on device sdb3, logical block 170978097
Jan 21 01:39:17 cp5010 kernel: [7472180.575001] blk_update_request: I/O error, dev sdb, sector 1387450928
Jan 21 01:39:17 cp5010 kernel: [7472180.575004] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173431367)
Jan 21 01:39:17 cp5010 kernel: [7472180.575005] Buffer I/O error on device sdb3, logical block 170989638
Jan 21 01:39:17 cp5010 kernel: [7472180.575009] blk_update_request: I/O error, dev sdb, sector 1387705752
Jan 21 01:39:17 cp5010 kernel: [7472180.575011] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173463220)
Jan 21 01:39:17 cp5010 kernel: [7472180.575012] Buffer I/O error on device sdb3, logical block 171021491
Jan 21 01:39:17 cp5010 kernel: [7472180.575014] blk_update_request: I/O error, dev sdb, sector 1387727912
Jan 21 01:39:17 cp5010 kernel: [7472180.575016] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173465992)
Jan 21 01:39:17 cp5010 kernel: [7472180.575017] Buffer I/O error on device sdb3, logical block 171024261
Jan 21 01:39:17 cp5010 kernel: [7472180.575018] Buffer I/O error on device sdb3, logical block 171024262
Jan 21 01:39:17 cp5010 kernel: [7472180.575071] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173563570)
Jan 21 01:39:17 cp5010 kernel: [7472180.575078] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173588828)
Jan 21 01:39:17 cp5010 kernel: [7472180.575083] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173646893)
Jan 21 01:39:17 cp5010 kernel: [7472180.581918] Buffer I/O error on dev sdb3, logical block 2968, lost async page write
Jan 21 01:39:20 cp5010 kernel: [7472183.994814] Buffer I/O error on dev sdb3, logical block 190472252, lost async page write
Jan 21 01:39:21 cp5010 kernel: [7472184.092088] EXT4-fs (sdb3): Delayed block allocation failed for inode 12 at logical offset 12485094 with max blocks 739 with error 5
Jan 21 01:39:21 cp5010 kernel: [7472184.105848] EXT4-fs (sdb3): This should not happen!! Data will be lost
Jan 21 01:39:21 cp5010 kernel: [7472184.116381] Buffer I/O error on dev sdb3, logical block 2968, lost async page write
Jan 21 01:39:21 cp5010 kernel: [7472184.125134] Buffer I/O error on dev sdb3, logical block 190472252, lost async page write
Jan 21 01:39:21 cp5010 kernel: [7472184.134545] EXT4-fs (sdb3): Delayed block allocation failed for inode 12 at logical offset 12485833 with max blocks 9 with error 5
Jan 21 01:39:21 cp5010 kernel: [7472184.138930]  disk 1, wo:1, o:0, dev:sdb1
Jan 21 01:39:21 cp5010 kernel: [7472184.148144] EXT4-fs (sdb3): This should not happen!! Data will be lost
Jan 21 01:39:21 cp5010 kernel: [7472184.163052] sd 1:0:0:0: [sdb] Stopping disk
Jan 21 01:39:21 cp5010 kernel: [7472184.163065] sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

I've depooled the node till the disk can be replaced

Vgutierrez moved this task from Triage to Hardware on the Traffic board.Jan 21 2019, 7:00 AM
jijiki triaged this task as Medium priority.Feb 7 2019, 12:03 PM

since @ayounsi is going to eqsin datacenter later this month maybe we could join efforts and replace sdb.
^^ @RobH

RobH added a comment.Feb 7 2019, 8:33 PM

Ok, I opened a support request with dell to ship a replacement SSD to eqsin:

Confirmed: Request 986142470 was successfully submitted.

RobH added a comment.Feb 7 2019, 9:05 PM

Oh, just the output from troubleshooting on the system. The system should show TWO SSDs and only sees one now:

robh@cp5010:~$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0]
      9756672 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>
robh@cp5010:~$ mdadm --detail /dev/md0
-bash: mdadm: command not found
robh@cp5010:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Aug 15 06:58:30 2018
     Raid Level : raid1
     Array Size : 9756672 (9.30 GiB 9.99 GB)
  Used Dev Size : 9756672 (9.30 GiB 9.99 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Thu Feb  7 20:28:43 2019
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : cp5010:0  (local to host cp5010)
           UUID : 8bc57fb1:5c18ef46:f4dccfe5:60b6f17c
         Events : 1197351

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       -       0        0        1      removed
robh@cp5010:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | Oct-31-2017 | 13:33:49 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | Dec-05-2017 | 16:56:51 | PS Redundancy    | Power Supply             | Fully Redundant
3   | Dec-06-2017 | 16:09:10 | Status           | Power Supply             | Power Supply input lost (AC/DC)
4   | Dec-06-2017 | 16:09:15 | Status           | Power Supply             | Power Supply input lost (AC/DC)
5   | Oct-12-2018 | 14:10:57 | PS Redundancy    | Power Supply             | Redundancy Lost
6   | Oct-12-2018 | 14:10:57 | Status           | Power Supply             | Power Supply input lost (AC/DC)
7   | Oct-12-2018 | 20:38:17 | Status           | Power Supply             | Power Supply input lost (AC/DC)
8   | Oct-12-2018 | 20:38:22 | PS Redundancy    | Power Supply             | Fully Redundant
9   | Oct-14-2018 | 14:13:02 | Status           | Power Supply             | Power Supply input lost (AC/DC)
10  | Oct-14-2018 | 14:13:03 | PS Redundancy    | Power Supply             | Redundancy Lost
11  | Oct-14-2018 | 19:05:22 | Status           | Power Supply             | Power Supply input lost (AC/DC)
12  | Oct-14-2018 | 19:05:28 | PS Redundancy    | Power Supply             | Fully Redundant
robh@cp5010:~$ sudo lshw -class disk
  *-disk                    
       description: ATA Disk
       product: INTEL SSDSC2BA80
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: DL2D
       serial: BTHV731106FS800OGN
       size: 745GiB (800GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=59be4a65
robh@cp5010:~$ exit

that's right, the kernel shutdown sdb due to the errors, that's why is not even listed on lshw

here is the log line:

Jan 21 01:39:21 cp5010 kernel: [7472184.163052] sd 1:0:0:0: [sdb] Stopping disk
RobH assigned this task to ayounsi.Feb 11 2019, 9:42 PM

Ok, support case opened with Dell and a replacement SSD has been dispatched. details below:

  • Dell case SFDC 21867874
  • DPS Tracking for SSD: 91913423457
  • EQ SG3 inbound shipment ticket: 1-185493092237

Please swap out the failed SDB and replace with new disk, old disk should also include a return tag to ship the defective SSD back to Dell.

RobH added a comment.Feb 19 2019, 10:51 PM

Ok, there has been multiple back and forth on this via email with both Dell SG and DHL. We've advised DHL of the SG3 inbound shipment ticket to refer to when attempting to deliver this package. Then they replied back with some BS about needing a local number, so I gave them Arzhels GVoice/cell so they can contact him directly. I advised them to deliver to SG3 shipping/receiving 24/7 and gave them all the info!

Mentioned in SAL (#wikimedia-operations) [2019-02-21T03:21:09Z] <XioNoX> replace cp5010 disk 1 - T214274

Mentioned in SAL (#wikimedia-operations) [2019-02-21T05:34:21Z] <bblack> rebooting cp5010 for device name on swapped disk (depooled) - T214274

Mentioned in SAL (#wikimedia-operations) [2019-02-21T05:41:58Z] <bblack> removing cp5010 downtimes from icinga - T214274

Mentioned in SAL (#wikimedia-operations) [2019-02-21T05:46:44Z] <bblack> repooling cp5010 - T214274

BBlack closed this task as Resolved.Feb 21 2019, 5:48 AM
BBlack added a subscriber: BBlack.

Seems to be working fine after replacement!

return shipment ticket 1-185737841426 opened with Equinix, DHL should pick up the defective disk in the next few days.