Page MenuHomePhabricator

Degraded RAID on aqs1015
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1015. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[0] sdf2[2](F) sdg2[1] sdh2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      bitmap: 3/28 pages [12KB], 65536KB chunk

md1 : active raid10 sda2[0] sdc2[2] sdb2[1] sdd2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 2/28 pages [8KB], 65536KB chunk

md0 : active raid10 sda1[0] sdc1[2] sdd1[3] sdb1[1]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Event Timeline

Jclark-ctr added subscribers: Eevans, Jclark-ctr.

@Eevans This server is out of Warranty We have used drives from recently Decom servers please advise when and if you would like to replace.

@Eevans This server is out of Warranty We have used drives from recently Decom servers please advise when and if you would like to replace.

Yes, please. This is /dev/sdf —scsi@8:0.0.0— physical ID 2 (on the second controller):

*-disk:2
      description: SCSI Disk
      physical id: 2
      bus info: scsi@8:0.0.0
      logical name: /dev/sdf
      size: 1788GiB (1920GB)
      configuration: logicalsectorsize=512 sectorsize=4096
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[0] sdf2[2](F) sdg2[1] sdh2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      bitmap: 7/28 pages [28KB], 65536KB chunk

md1 : active raid10 sda2[0] sdc2[2] sdb2[1] sdd2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [=====>...............]  check = 27.0% (1000523776/3701655552) finish=2031.8min speed=22156K/sec
      bitmap: 3/28 pages [12KB], 65536KB chunk

md0 : active raid10 sda1[0] sdc1[2] sdd1[3] sdb1[1]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Oddly, the DRAC seems to think that everything is Good (clearly it is not). No errors, and no logs since Mon 16 Nov 2020 20:24:55(!!)


You can take it down at your convenience, though if you give me a heads up when you're ready, I can set a downtime.

Thanks.

IDRAC hardware inventory
SerialNumber KN09N7919I0709R2U
Slot 6

sdf     KN09N7919I0709R2U
├─sdf1
└─sdf2

@Eevans Removed a failed drive and inserted a replacement drive from a decommissioned server. It appears that md126 and md127 were automatically assembled from existing data on the SSD. A reboot is expected to remap /dev/sdi as /dev/sdf.

Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md126 : inactive sdi[1] sdf[0]
      3749445632 blocks super external:/md127/0

md127 : inactive sdi[0](S)
      651608 blocks super external:ddf

md2 : active raid10 sde2[0] sdg2[1] sdh2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      bitmap: 9/28 pages [36KB], 65536KB chunk

md1 : active raid10 sda2[0] sdc2[2] sdb2[1] sdd2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 2/28 pages [8KB], 65536KB chunk

md0 : active raid10 sda1[0] sdc1[2] sdd1[3] sdb1[1]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

unused devices: <none>

updated IDRAC Firmware Version 4.40.00.00 <-> 7.00.00.181

Mentioned in SAL (#wikimedia-operations) [2025-04-18T14:45:44Z] <urandom> rebooting aqs1015.eqiad.wmnet (drive detection/ordering) — T391903

Host rebooted by eevans@cumin1002 with reason: None

Thanks @Jclark-ctr

We're still down one drive though I'm afraid: scsi@7:0.0.0 (physical ID 2 on the second controller).

I've rebooted the host via the cookbook (and there was indeed some device reordering), and stopped md126 & md127 (nothing else).

eevans@aqs1015:~$ sudo lshw -class disk
  *-disk:0                  
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: DD01
       serial: KN09N7919I0709R2S
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=2a9ac52b
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: DD01
       serial: KN09N7919I0709R21
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=9abc9d9f
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@4:0.0.0
       logical name: /dev/sdc
       version: DD01
       serial: KN09N7919I0709R29
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=90b35958
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@5:0.0.0
       logical name: /dev/sde
       version: DD01
       serial: KN09N7919I0709R1Z
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=2a35a2bf
  *-disk:0
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@6:0.0.0
       logical name: /dev/sdd
       version: DD01
       serial: KN09N7919I0709R1W
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=5a17cea6
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@7:0.0.0
       logical name: /dev/sdf
       version: DD01
       serial: KN09N7919I0709R2W
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=670685e1
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@9:0.0.0
       logical name: /dev/sdh
       version: DD01
       serial: KN09N7919I0709R2T
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=4b5c60e9
eevans@aqs1015:~$
eevans@aqs1015:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:20:07 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Apr 18 14:59:13 2025
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1015:2  (local to host aqs1015)
              UUID : 470004b4:88cd9ca4:241833fa:6606b9fb
            Events : 665743

    Number   Major   Minor   RaidDevice State
       0       8       50        0      active sync set-A   /dev/sdd2
       1       8       82        1      active sync set-B   /dev/sdf2
       -       0        0        2      removed
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1015:~$

Let me know what you would like to do i can remove drive you can reboot

Server shows 8 drives 

NAME    MAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
sda       8:0    0  1.7T  0 disk
├─sda1    8:1    0 23.3G  0 part
│ └─md0   9:0    0 46.5G  0 raid10 /
└─sda2    8:2    0  1.7T  0 part
  └─md1   9:1    0  3.4T  0 raid10 /srv/cassandra-a
sdb       8:16   0  1.7T  0 disk
├─sdb1    8:17   0 23.3G  0 part
│ └─md0   9:0    0 46.5G  0 raid10 /
└─sdb2    8:18   0  1.7T  0 part
  └─md1   9:1    0  3.4T  0 raid10 /srv/cassandra-a
sdc       8:32   0  1.7T  0 disk
├─sdc1    8:33   0 23.3G  0 part
│ └─md0   9:0    0 46.5G  0 raid10 /
└─sdc2    8:34   0  1.7T  0 part
  └─md1   9:1    0  3.4T  0 raid10 /srv/cassandra-a
sdd       8:48   0  1.7T  0 disk
├─sdd1    8:49   0 23.3G  0 part
└─sdd2    8:50   0  1.7T  0 part
  └─md2   9:2    0  3.4T  0 raid10 /srv/cassandra-b
sde       8:64   0  1.7T  0 disk
├─sde1    8:65   0 23.3G  0 part
│ └─md0   9:0    0 46.5G  0 raid10 /
└─sde2    8:66   0  1.7T  0 part
  └─md1   9:1    0  3.4T  0 raid10 /srv/cassandra-a
sdf       8:80   0  1.7T  0 disk
├─sdf1    8:81   0 23.3G  0 part
└─sdf2    8:82   0  1.7T  0 part
  └─md2   9:2    0  3.4T  0 raid10 /srv/cassandra-b
sdg       8:96   0  1.7T  0 disk
sdh       8:112  0  1.7T  0 disk
├─sdh1    8:113  0 23.3G  0 part
└─sdh2    8:114  0  1.7T  0 part
  └─md2   9:2    0  3.4T  0 raid10 /srv/cassandra-b

Let me know what you would like to do i can remove drive you can reboot

Server shows 8 drives 

[ ... ]

That's weird; I wonder why lsblk shows it, but lshw (still) does not?

At any rate, I've now partitioned and added it to the array, and it is rebuilding.

Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md2 : active raid10 sdg2[4] sdd2[0] sdh2[3] sdf2[1]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 3/28 pages [12KB], 65536KB chunk

md1 : active raid10 sda2[0] sde2[3] sdc2[2] sdb2[1]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 4/28 pages [16KB], 65536KB chunk

md0 : active raid10 sda1[0] sdc1[2] sde1[3] sdb1[1]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]

unused devices: <none>