Maniphest T362033

Degraded RAID on aqs1013
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Apr 8 2024, 12:31 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1013. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[4](F) sdh2[3] sdg2[2] sdf2[1]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      bitmap: 24/28 pages [96KB], 65536KB chunk

md1 : active raid10 sda2[0] sdd2[3] sdb2[1] sdc2[2]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 4/28 pages [16KB], 65536KB chunk

md0 : active raid10 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Jclark-ctr	T362033 Degraded RAID on aqs1013
		Resolved		Eevans	T364422 Reimage aqs1013

Event Timeline

ops-monitoring-bot created this task.Apr 8 2024, 12:31 AM

Warranty Expired 19 NOV 2023. Will look to see what drives we have available at data center

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

In T362033#9700949, @Jclark-ctr wrote:

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

Sure, go ahead.

P.S. I think this is the 4th time, are we just really unlucky, or is there some underlying factor at work?

Eevans added a project: Cassandra.Apr 18 2024, 2:02 PM

Eevans moved this task from Backlog to Next on the Cassandra board.

In T362033#9707550, @Eevans wrote:

In T362033#9700949, @Jclark-ctr wrote:

@Eevans Hey looks like same drive as T354499 is failed again let me know if i can replace it again

Sure, go ahead.

P.S. I think this is the 4th time, are we just really unlucky, or is there some underlying factor at work?

Hey @Jclark-ctr, any update on this?

@Eevans hey sorry about missing the update for being available i did just swap the drive now. When you are recreating the md2 what commands are you running?

Host rebooted by eevans@cumin1002 with reason: None

Here is a transcript of everything done (for posterity sake):

eevans@aqs1013:~$ sudo sgdisk -R /dev/sde /dev/sdg
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory. 
***************************************************************

The operation has completed successfully.
eevans@aqs1013:~$ sudo sgdisk -G /dev/sde
The operation has completed successfully.
eevans@aqs1013:~$ sudo sgdisk -p /dev/sdg
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory. 
***************************************************************

Disk /dev/sdg: 3750748848 sectors, 1.7 TiB
Model: HFS1T9G32FEH-BA1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): D3A4DD8F-DF2B-4851-8597-32C788463AC5
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 2669 sectors (1.3 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        48828415   23.3 GiB    FD00  Linux RAID
   2        48828416      3750748159   1.7 TiB     FD00  Linux RAID
eevans@aqs1013:~$ sudo sgdisk -p /dev/sde
Disk /dev/sde: 3750748848 sectors, 1.7 TiB
Model: MZ7LH1T9HMLT0D3 
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 057FC9FF-294C-4B14-A3EC-A624AF839077
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 2669 sectors (1.3 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        48828415   23.3 GiB    FD00  Linux RAID
   2        48828416      3750748159   1.7 TiB     FD00  Linux RAID
eevans@aqs1013:~$ sudo mdadm --manage /dev/md2 --add /dev/sde2
mdadm: added /dev/sde2
eevans@aqs1013:~$ cat /proc/mdstat 
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[4] sdg2[2] sdf2[1] sdh2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      [>....................]  recovery =  0.0% (628864/1850827776) finish=1520.0min speed=20285K/sec
      bitmap: 28/28 pages [112KB], 65536KB chunk

md0 : active raid10 sda1[0] sdc1[2] sdb1[1] sdd1[3]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
md1 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 0/28 pages [0KB], 65536KB chunk

unused devices: <none>
eevans@aqs1013:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 12:51:54 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Apr 23 23:13:15 2024
             State : clean, degraded, recovering 
    Active Devices : 3
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 0% complete

              Name : aqs1013:2  (local to host aqs1013)
              UUID : d71d53a0:5b6c3965:9ac81a5c:a4aa04c9
            Events : 3498017

    Number   Major   Minor   RaidDevice State
       4       8       66        0      spare rebuilding   /dev/sde2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8       98        2      active sync set-A   /dev/sdg2
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1013:~$

Jclark-ctr mentioned this in T363280: Degraded RAID on aqs1013.Wed, Apr 24, 2:25 PM

Ok, the rebuild is complete.

eevans@nyx:~$ ssh aqs1013.eqiad.wmnet -- sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 12:51:54 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Apr 25 14:25:21 2024
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : d71d53a0:5b6c3965:9ac81a5c:a4aa04c9
            Events : 3546682

    Number   Major   Minor   RaidDevice State
       4       8       66        0      active sync set-A   /dev/sde2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8       98        2      active sync set-A   /dev/sdg2
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@nyx:~$

Host rebooted by eevans@cumin1002 with reason: None

Jclark-ctr moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Mon, Apr 29, 5:51 PM

Jclark-ctr mentioned this in T363661: Degraded RAID on aqs1013.

@Volans We have replaced this drive 4 times now and continues to fail we no longer suspect that it is a Drive issue and maybe a process issues for recreating mdadm raid 10. We are also having same issue with aqs1014 Do you have any input or able to assist or know who might be best person to assist with issue?

@Jclark-ctr what do you mean by "process issues"? If mdadm shows the raid OK after the rebuilt I don't see problems there.

Have we already tried to exclude other kind of problems? Such as:

Upgrading firmware to see if it's a software issue
Trying to use a different disk bay (might require to rebuilt the raid from scratch)
Faulty internal cabling or motherboard
Power supply issues (if the voltage is not the correct one might explain the failures)

@Volans @Eevans same results between two different servers. total of 7 ssd have been swapped.
it completes rebuild and then fail 2-3 days later.
IDRAC shows no Errors.
only mdstat shows failed drive.
No other available disk bays in server to test other bays

Dmesg has this when it fails sd 9:0:0:0: [sdg] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

In T362033#9758428, @Volans wrote:

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

@Jclark-ctr and I discussed the same, and I guess it may come to that. It is pretty drastic. We typically reimage in a way that will preserve the data (the contents of this md). Doing a complete non-data-preserving reimage means decommissioning that host (transferring off all of its data to other nodes), and then bootstrapping (transferring it back). I'm almost more worried that it will fix it (the answer can't be to reimage on every SSD failure). :)

In T362033#9758428, @Volans wrote:

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

T364422: Reimage aqs1013

The machine has been reimaged and the instances bootstrapped. 🤞

That didn't take long:

/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon May 13 13:11:11 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 262387

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2

       0       8       50        -      faulty   /dev/sdd2

eevans@aqs1013:~$ sudo lshw -class disk
  *-disk:0                  
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: DD01
       serial: KN09N7919I0709R2F
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=280cfc8d
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: DD01
       serial: KN09N7919I0709R2C
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=868c5b47
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@4:0.0.0
       logical name: /dev/sdc
       version: DD01
       serial: KN09N7919I0709R42
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=1edf19f8
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@5:0.0.0
       logical name: /dev/sde
       version: DD01
       serial: KN09N7919I0709R2L
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=a8d8ff05
  *-disk:0
       description: SCSI Disk
       physical id: 0
       bus info: scsi@6:0.0.0
       logical name: /dev/sdd
       size: 1788GiB (1920GB)
       configuration: logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@7:0.0.0
       logical name: /dev/sdf
       version: DD01
       serial: KN09N7919I0709R46
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d287332a
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@8:0.0.0
       logical name: /dev/sdg
       version: DD01
       serial: KN09N7919I0709R44
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=63d0f241
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@9:0.0.0
       logical name: /dev/sdh
       version: DD01
       serial: KN09N7919I0709R43
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d4b5ee4a
eevans@aqs1013:~$

(sata:1, disk:0)

dmesg

[ ... ]
[338641.858168] scsi_io_completion_action: 3 callbacks suppressed
[338641.858173] sd 6:0:0:0: [sdd] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.858176] sd 6:0:0:0: [sdd] tag#6 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.858177] print_req_error: 3 callbacks suppressed
[338641.858179] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.868116] buffer_io_error: 3 callbacks suppressed
[338641.868117] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.875075] sd 6:0:0:0: [sdd] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.875081] sd 6:0:0:0: [sdd] tag#11 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.875086] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.885018] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.891939] sd 6:0:0:0: [sdd] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.891942] sd 6:0:0:0: [sdd] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.891945] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.901867] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.908768] sd 6:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.908771] sd 6:0:0:0: [sdd] tag#12 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.908774] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.918685] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.925589] sd 6:0:0:0: [sdd] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.925591] sd 6:0:0:0: [sdd] tag#16 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.925593] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.935518] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.942548] sd 6:0:0:0: [sdd] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.942555] sd 6:0:0:0: [sdd] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.942562] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.952490] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.959430] sd 6:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.959435] sd 6:0:0:0: [sdd] tag#24 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.959438] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.969359] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.976258] sd 6:0:0:0: [sdd] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.976264] sd 6:0:0:0: [sdd] tag#17 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.976266] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.986174] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.993081] sd 6:0:0:0: [sdd] tag#25 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.993082] sd 6:0:0:0: [sdd] tag#25 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.993084] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338642.002995] Buffer I/O error on dev sdd, logical block 0, async page read
[338642.009937] sd 6:0:0:0: [sdd] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338642.009944] sd 6:0:0:0: [sdd] tag#13 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338642.009949] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338642.019874] Buffer I/O error on dev sdd, logical block 0, async page read

lshw.txt86 KBDownload

dmesg.txt223 KBDownload

The failed device (sdd) was replaced; This time we're using sfdisk to copy the partition table.

The first run complained of a 'ddf_raid_member' signature remaining on the device, and recommended using --wipe:

root@aqs1013:~# sfdisk -d /dev/sdf | sfdisk /dev/sdd
Checking that no-one is using this disk right now ... OK

The device contains 'ddf_raid_member' signature and it may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

Disk /dev/sdd: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0xd287332a.
The device contains 'ddf_raid_member' signature and it may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

/dev/sdd1: Created a new partition 1 of type 'Linux raid autodetect' and of size 23.3 GiB.
/dev/sdd2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1.7 TiB.
/dev/sdd3: Done.

New situation:
Disklabel type: dos
Disk identifier: 0xd287332a

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sdd2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@aqs1013:~#

So I did:

root@aqs1013:~# sfdisk -d /dev/sdf | sfdisk --wipe always /dev/sdd
Checking that no-one is using this disk right now ... OK

The device contains 'ddf_raid_member' signature and it will be removed by a write command. See sfdisk(8) man page and --wipe option for more details.

Disk /dev/sdd: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xd287332a

Old situation:

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sdd2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0xd287332a.
The device contains 'ddf_raid_member' signature and it will be removed by a write command. See sfdisk(8) man page and --wipe option for more details.

/dev/sdd1: Created a new partition 1 of type 'Linux raid autodetect' and of size 23.3 GiB.
/dev/sdd2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1.7 TiB.
/dev/sdd3: Done.

New situation:
Disklabel type: dos
Disk identifier: 0xd287332a

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sdd2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@aqs1013:~#

Afterward:

root@aqs1013:~# sfdisk -d /dev/sdf
label: dos
label-id: 0xd287332a
device: /dev/sdf
unit: sectors
sector-size: 512

/dev/sdf1 : start=        2048, size=    48826368, type=fd
/dev/sdf2 : start=    48828416, size=  3701919744, type=fd
root@aqs1013:~# sfdisk -d /dev/sdd
label: dos
label-id: 0xd287332a
device: /dev/sdd
unit: sectors
sector-size: 512

/dev/sdd1 : start=        2048, size=    48826368, type=fd
/dev/sdd2 : start=    48828416, size=  3701919744, type=fd
root@aqs1013:~#

And the array is rebuilding:

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon May 13 19:32:12 2024
             State : active, degraded, recovering 
    Active Devices : 3
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 0% complete

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 277477

    Number   Major   Minor   RaidDevice State
       4       8       50        0      spare rebuilding   /dev/sdd2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1013:~$

🤞

The array has rebuilt, but I could swear I hear it ticking...

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue May 14 20:57:06 2024
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 343971

    Number   Major   Minor   RaidDevice State
       4       8       50        0      active sync set-A   /dev/sdd2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1013:~$

In T362033#9798051, @Eevans wrote:

The array has rebuilt, but I could swear I hear it ticking...

💥

dmesg

[ ... ]

[898421.304851] md: super_written gets error=-5
[898421.309130] md/raid10:md2: Disk failure on sdd2, disabling device.
                md/raid10:md2: Operation continuing on 3 devices.
[898421.321628] md/raid10:md2: sdf2: redirecting sector 2358221760 to another mirror
[898421.331297] md/raid10:md2: sdf2: redirecting sector 7027993248 to another mirror
[898421.339043] md/raid10:md2: sdf2: redirecting sector 7027993280 to another mirror
[898421.346785] md/raid10:md2: sdf2: redirecting sector 2358221792 to another mirror
[898421.354914] md/raid10:md2: sdf2: redirecting sector 7027993312 to another mirror
[898421.364310] md/raid10:md2: sdf2: redirecting sector 2358221648 to another mirror
[898421.372021] md/raid10:md2: sdf2: redirecting sector 2996306928 to another mirror
[898421.381084] md/raid10:md2: sdf2: redirecting sector 2996306592 to another mirror
[898421.388829] md/raid10:md2: sdf2: redirecting sector 2996308224 to another mirror
[899529.356235] scsi_io_completion_action: 115 callbacks suppressed
[899529.356240] sd 6:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.356243] sd 6:0:0:0: [sdd] tag#19 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[899529.356245] print_req_error: 115 callbacks suppressed
[899529.356246] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[899529.367184] sd 6:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.367190] sd 6:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[899529.367201] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[899529.377750] Buffer I/O error on dev sdd1, logical block 6103280, async page read
[899529.385682] sd 6:0:0:0: [sdd] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.385690] sd 6:0:0:0: [sdd] tag#20 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[899529.385696] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[899529.396842] sd 6:0:0:0: [sdd] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.396850] sd 6:0:0:0: [sdd] tag#18 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[899529.396855] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[899529.407575] Buffer I/O error on dev sdd2, logical block 462739952, async page read
[900011.705834] sd 6:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[900011.705840] sd 6:0:0:0: [sdd] tag#0 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[900680.719961] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.719975] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.719982] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.750020] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.750034] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.750040] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[901317.860695] sd 6:0:0:0: [sdd] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.860699] sd 6:0:0:0: [sdd] tag#31 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[901317.860701] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[901317.871613] sd 6:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.871616] sd 6:0:0:0: [sdd] tag#23 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[901317.871619] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[901317.882154] Buffer I/O error on dev sdd1, logical block 6103280, async page read
[901317.889942] sd 6:0:0:0: [sdd] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.889947] sd 6:0:0:0: [sdd] tag#7 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[901317.889951] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[901317.901066] sd 6:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.901074] sd 6:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[901317.901081] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[901317.911807] Buffer I/O error on dev sdd2, logical block 462739952, async page read
[901811.696680] sd 6:0:0:0: [sdd] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901811.696686] sd 6:0:0:0: [sdd] tag#13 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon May 20 01:53:26 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 349556

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2

       4       8       50        -      faulty   /dev/sdd2
eevans@aqs1013:~$

	F52905257: dmesg.txt
	Mon, May 13, 1:18 PM

	F52905256: lshw.txt
	Mon, May 13, 1:18 PM

Degraded RAID on aqs1013Open, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on aqs1013
Open, Needs TriagePublic
Actions

Related Objects
Search...