Page MenuHomePhabricator

Degraded RAID on elastic2048
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2048. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0](F) sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
md1 : active raid0 sda2[0] sdb2[1]
      3066771456 blocks super 1.2 512k chunks
      
unused devices: <none>

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-04-03T20:40:39Z] <gehel> excluding elastic2048 from cluster and depooling - T220038

Node is depooled and excluded from the cluster. @Papaul if you have a spare, feel free to do what needs doing. Ping me when done and I'll reimage.

Please note this is an in warranty system, and thus doesn't use onsite spares. @Papaul will need to open a dell dispatch for a replacement part.

Papaul triaged this task as Medium priority.

@Gehel the IDRAC is not showing any failed disk. Can you from the OS pull up the log showing the failed disk.

Thanks.

ehel@elastic2048:~$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0](F) sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
md1 : active raid0 sda2[0] sdb2[1]
      3066771456 blocks super 1.2 512k chunks
      
unused devices: <none>

If I understand that correctly, sda1 is marked as failed.

From syslog:

Apr  4 06:25:04 elastic2048 kernel: [10370894.621036] sd 2:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.621042] sd 2:0:0:0: [sda] tag#24 CDB: Read(10) 28 00 94 84 40 20 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.621045] blk_update_request: I/O error, dev sda, sector 2491695136
Apr  4 06:25:04 elastic2048 kernel: [10370894.627858] sd 2:0:0:0: [sda] tag#25 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.627861] sd 2:0:0:0: [sda] tag#25 CDB: Read(10) 28 00 94 84 41 c0 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.627863] blk_update_request: I/O error, dev sda, sector 2491695552
Apr  4 06:25:04 elastic2048 kernel: [10370894.635890] sd 2:0:0:0: [sda] tag#26 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.635894] sd 2:0:0:0: [sda] tag#26 CDB: Read(10) 28 00 94 84 40 20 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.635896] blk_update_request: I/O error, dev sda, sector 2491695136
Apr  4 06:25:04 elastic2048 kernel: [10370894.642889] sd 2:0:0:0: [sda] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.642894] sd 2:0:0:0: [sda] tag#27 CDB: Read(10) 28 00 94 84 41 c0 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.642897] blk_update_request: I/O error, dev sda, sector 2491695552
Apr  4 06:25:04 elastic2048 kernel: [10370894.651185] sd 2:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.651190] sd 2:0:0:0: [sda] tag#28 CDB: Read(10) 28 00 94 84 40 20 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.651193] blk_update_request: I/O error, dev sda, sector 2491695136
Apr  4 06:25:04 elastic2048 kernel: [10370894.658003] sd 2:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.658007] sd 2:0:0:0: [sda] tag#29 CDB: Read(10) 28 00 94 84 41 c0 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.658009] blk_update_request: I/O error, dev sda, sector 2491695552
Apr  4 06:25:04 elastic2048 kernel: [10370894.665899] sd 2:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.665903] sd 2:0:0:0: [sda] tag#30 CDB: Read(10) 28 00 94 84 40 20 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.665905] blk_update_request: I/O error, dev sda, sector 2491695136
Apr  4 06:25:04 elastic2048 kernel: [10370894.672871] sd 2:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.672876] sd 2:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 94 84 41 c0 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.672879] blk_update_request: I/O error, dev sda, sector 2491695552
Apr  4 06:25:04 elastic2048 kernel: [10370894.680588] sd 2:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.680592] sd 2:0:0:0: [sda] tag#1 CDB: Read(10) 28 00 94 a5 1a 38 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.680595] blk_update_request: I/O error, dev sda, sector 2493848120
Apr  4 06:25:04 elastic2048 kernel: [10370894.687432] sd 2:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr  4 06:25:04 elastic2048 kernel: [10370894.687438] sd 2:0:0:0: [sda] tag#2 CDB: Read(10) 28 00 94 84 47 a0 00 00 08 00
Apr  4 06:25:04 elastic2048 kernel: [10370894.687441] blk_update_request: I/O error, dev sda, sector 2491697056

Also:

robh@elastic2048:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Dec  4 16:44:41 2018
     Raid Level : raid1
     Array Size : 29279232 (27.92 GiB 29.98 GB)
  Used Dev Size : 29279232 (27.92 GiB 29.98 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu Apr  4 15:38:40 2019
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : elastic2048:0  (local to host elastic2048)
           UUID : 7240fb38:5138f83f:47f5fc11:22087110
         Events : 35997

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

       0       8        1        -      faulty   /dev/sda1
robh@elastic2048:~$ sudo mdadm --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Tue Dec  4 16:44:41 2018
     Raid Level : raid0
     Array Size : 3066771456 (2924.70 GiB 3140.37 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Tue Dec  4 16:44:41 2018
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : elastic2048:1  (local to host elastic2048)
           UUID : e88c6d62:ee21b88a:3cfa7ad1:5307a8c2
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
robh@elastic2048:~$

SDA is bad and needs replacement.

Create Dispatch: Success
You have successfully submitted request SR988826339.

Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on your Dell EMC TechDirect dashboard.

Dear Papaul,

Your dispatch shipped on 4/4/2019 3:26 PM

@Gehel Disk replaced. Let me know how the reimage goes before i send back the bad disk. Thanks

Return information

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['elastic2048.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904091011_gehel_8457.log.

Completed auto-reimage of hosts:

['elastic2048.codfw.wmnet']

Of which those FAILED:

['elastic2048.codfw.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['elastic2048.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904091300_gehel_13678.log.

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['elastic2048.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904091411_gehel_27700.log.

Completed auto-reimage of hosts:

['elastic2048.codfw.wmnet']

Of which those FAILED:

['elastic2048.codfw.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts:

['elastic2048.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201904091419_gehel_30419.log.

Completed auto-reimage of hosts:

['elastic2048.codfw.wmnet']

and were ALL successful.

Reimage was problematic, with first a puppet failure and then the server not booting over PXE. Manually booting in PXE (F12) finally fixed the issue.

Reimage is complete, server is un-banned and pooled.

This is done, we can close it.