Degraded RAID on aqs1014
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Thu, Apr 18, 12:53 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1014. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[0] sdh2[3] sdf2[2] sdg2[1](F)
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [U_UU]
      bitmap: 5/28 pages [20KB], 65536KB chunk

md1 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 5/28 pages [20KB], 65536KB chunk

md0 : active raid10 sdc1[2] sda1[0] sdb1[1] sdd1[3]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Related Objects

Mentioned In: T363522: Degraded RAID on aqs1014
T363580: Degraded RAID on aqs1014

Event Timeline

ops-monitoring-bot created this task.Thu, Apr 18, 12:53 AM

Eevans added a project: Cassandra.Thu, Apr 18, 2:03 PM

Eevans moved this task from Backlog to Next on the Cassandra board.

Hey @Jclark-ctr: I hope it's OK to assign this one to you as well.

@Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning

In T362841#9737870, @Jclark-ctr wrote:

@Eevans this one is out of warranty also let me know if i am able to swap drive i can take care of in morning

eevans@aqs1014:~$ sudo mdadm --remove /dev/md2 /dev/sdg2
mdadm: hot removed /dev/sdg2 from /dev/md2
eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Apr 23 23:01:08 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 640603

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       -       0        0        1      removed
       2       8       82        2      active sync set-A   /dev/sdf2
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1014:~$

Good to go.

@Eevans Replaced drive

In T362841#9740826, @Jclark-ctr wrote:

@Eevans Replaced drive

Did something happen to sdf during the swap?

[Apr24 15:05] ata9: SATA link down (SStatus 0 SControl 300)
[  +5.551028] ata9: SATA link down (SStatus 0 SControl 300)
[  +0.000010] ata9: limiting SATA link speed to <unknown>
[  +2.340884] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 3F0)
[  +0.000220] ata9.00: model number mismatch 'HFS1T9G32FEH-BA10A' != 'MZ7KH1T9HAJR0D3'
[  +0.000004] ata9.00: revalidation failed (errno=-19)
[  +0.005259] ata9.00: disabled
[  +0.006505] sd 8:0:0:0: rejecting I/O to offline device
[  +0.005506] blk_update_request: I/O error, dev sdf, sector 2971071048 op 0x0:(READ) flags 0x0 phys_seg 55 prio class 0
[  +0.010982] raid10_end_read_request: 12 callbacks suppressed
[  +0.000003] md/raid10:md2: sdf2: rescheduling sector 5843957320
[  +0.006191] blk_update_request: I/O error, dev sdf, sector 2971072072 op 0x0:(READ) flags 0x0 phys_seg 55 prio class 0
[  +0.010958] md/raid10:md2: sdf2: rescheduling sector 5843959368
[  +0.006213] blk_update_request: I/O error, dev sdf, sector 2971080704 op 0x0:(READ) flags 0x0 phys_seg 40 prio class 0
[  +0.010971] md/raid10:md2: sdf2: rescheduling sector 5843977216
[  +0.006189] blk_update_request: I/O error, dev sdf, sector 139514880 op 0x0:(READ) flags 0x0 phys_seg 41 prio class 0
[  +0.010867] md/raid10:md2: sdf2: rescheduling sector 180845568
[  +0.006104] blk_update_request: I/O error, dev sdf, sector 142063176 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.010778] md/raid10:md2: sdf2: rescheduling sector 185941576
[  +0.006099] blk_update_request: I/O error, dev sdf, sector 142063368 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.010773] md/raid10:md2: sdf2: rescheduling sector 185941768
[  +0.006094] blk_update_request: I/O error, dev sdf, sector 142063424 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[  +0.010778] md/raid10:md2: sdf2: rescheduling sector 185941824
[  +0.006096] blk_update_request: I/O error, dev sdf, sector 142063528 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.010790] md/raid10:md2: sdf2: rescheduling sector 185941928
[  +0.006105] blk_update_request: I/O error, dev sdf, sector 142063600 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.010777] md/raid10:md2: sdf2: rescheduling sector 185942000
[  +0.006101] blk_update_request: I/O error, dev sdf, sector 142080472 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.010797] md/raid10:md2: sdf2: rescheduling sector 185976280
[  +0.006102] md: super_written gets error=-5
[  +0.004463] md/raid10:md2: Disk failure on sdf2, disabling device.
              md/raid10:md2: Operation continuing on 2 devices.

[ ... ]

[  +5.062114] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.000296] ata9.00: failed to read native max address (err_mask=0x1)
[  +0.000003] ata9.00: HPA support seems broken, skipping HPA handling
[  +0.001336] ata9.00: ATA-11: MZ7KH1T9HAJR0D3,     HF56, max UDMA/133
[  +0.000006] ata9.00: 3750748848 sectors, multi 16: LBA48 NCQ (depth 32), AA
[  +0.001612] ata9.00: configured for UDMA/133
[  +0.000030] ata9.00: detaching (SCSI 8:0:0:0)
[  +0.001297] sd 8:0:0:0: [sdf] Stopping disk
[  +5.221705] scsi 8:0:0:0: Direct-Access     ATA      MZ7KH1T9HAJR0D3  HF56 PQ: 0 ANSI: 5
[  +0.000299] sd 8:0:0:0: Attached scsi generic sg6 type 0
[  +0.000307] sd 8:0:0:0: [sdf] 3750748848 512-byte logical blocks: (1.92 TB/1.75 TiB)
[  +0.000005] sd 8:0:0:0: [sdf] 4096-byte physical blocks
[  +0.000050] sd 8:0:0:0: [sdf] Write Protect is off
[  +0.000007] sd 8:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[  +0.000043] sd 8:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  +0.078376] sd 8:0:0:0: [sdf] Attached SCSI disk

Also, afterward, errors for sdg continue...

[ ... ]

[  +0.010098] Buffer I/O error on dev sdg, logical block 0, async page read
[  +0.007129] sd 7:0:0:0: [sdg] tag#4 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[  +0.000007] sd 7:0:0:0: [sdg] tag#4 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[  +0.000005] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.010104] Buffer I/O error on dev sdg, logical block 0, async page read
[  +0.007171] sd 7:0:0:0: [sdg] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[  +0.000007] sd 7:0:0:0: [sdg] tag#27 CDB: Read(10) 28 00 00 00 00 80 00 00 08 00
[  +0.000005] blk_update_request: I/O error, dev sdg, sector 128 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.010729] Buffer I/O error on dev sdg, logical block 16, async page read
[  +0.007255] Buffer I/O error on dev sdg, logical block 0, async page read

... and the output of lshw doesn't look right (the product field is missing):

*-disk:1
     description: SCSI Disk
     physical id: 1
     bus info: scsi@7:0.0.0
     logical name: /dev/sdg
     size: 1788GiB (1920GB)
     configuration: logicalsectorsize=512 sectorsize=4096

The array is in a precarious position at the moment:

eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Apr 24 15:22:36 2024
             State : clean, degraded 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 675094

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       -       0        0        1      removed
       -       0        0        2      removed
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1014:~$

lshw.log2 KBDownload

dmesg.log535 KBDownload

Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdg are swapped in slots. after sdf rebuilds i can swap sdg

In T362841#9741498, @Jclark-ctr wrote:

Looking at lshw.log and inventory on idrac it looks like all the drives are in order except sdf ,sdh are swapped in slots. after sdf rebuilds i can swap sdh

For point of clarification, sdg (not sdh), yes?

Corrected typo

Having some trouble adding sdf2 back into the array: mdadm: Cannot open /dev/sdf2: Device or resource busy :/

eevans@aqs1014:~$ sudo sgdisk -R /dev/sdf /dev/sde
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory. 
***************************************************************

The operation has completed successfully.
eevans@aqs1014:~$  sudo sgdisk -G /dev/sdf
The operation has completed successfully.
eevans@aqs1014:~$ eevans@aqs1014:~$ sudo sgdisk -p /dev/sde
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory. 
***************************************************************

Disk /dev/sde: 3750748848 sectors, 1.7 TiB
Model: HFS1T9G32FEH-BA1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): C9A2BDE8-5AEF-4E79-8C0F-58FD53645C5B
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 2669 sectors (1.3 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        48828415   23.3 GiB    FD00  Linux RAID
   2        48828416      3750748159   1.7 TiB     FD00  Linux RAID
eevans@aqs1014:~$ sudo sgdisk -p /dev/sdf
Disk /dev/sdf: 3750748848 sectors, 1.7 TiB
Model: MZ7KH1T9HAJR0D3 
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): EBCFBF01-C7CF-4A9C-938B-FCC2AA588F1C
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 2669 sectors (1.3 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        48828415   23.3 GiB    FD00  Linux RAID
   2        48828416      3750748159   1.7 TiB     FD00  Linux RAID
eevans@aqs1014:~$  sudo mdadm --manage /dev/md2 --add /dev/sdf2
mdadm: Cannot open /dev/sdf2: Device or resource busy
eevans@aqs1014:~$

2:23 PM <jclark-ctr> i am swapping sdf again
2:24 PM <jclark-ctr> swapped with one that was just erased

Ok, the newly erased device was detected as sdi. It has been added, and is rebuilding:

eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Apr 24 19:30:35 2024
             State : clean, degraded, recovering 
    Active Devices : 2
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 0% complete

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 682496

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       4       8      130        1      spare rebuilding   /dev/sdi2
       -       0        0        2      removed
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1014:~$

The first device is done rebuilding:

eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Apr 26 14:01:01 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 777031

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       4       8      130        1      active sync set-B   /dev/sdi2
       -       0        0        2      removed
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1014:~$

VRiley-WMF mentioned this in T363580: Degraded RAID on aqs1014.Mon, Apr 29, 11:24 AM

VRiley-WMF mentioned this in T363522: Degraded RAID on aqs1014.

Jclark-ctr moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Mon, Apr 29, 5:50 PM

Ok, so to summarize what has happened so far:

We set about to replace what was sdg, but the wrong device was pulled by accident (sdf was pulled). When that drive was reinstalled (wiped first, I think?), it became sdi. I added it to the md2 and waiting for it to complete (see comment above). Afterward I rebooted the machine, and that device reordered to sdf. That device has failed as well now (see attached dmesg log).

eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Apr 29 18:43:29 2024
             State : clean, degraded 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 936697

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       -       0        0        1      removed
       -       0        0        2      removed
       3       8       98        3      active sync set-B   /dev/sdg2

       4       8       82        -      faulty   /dev/sdf2
eevans@aqs1014:~$

dmesg.log182 KBDownload

Ok, sdf has been replaced again, here is a transcript of what was done to add it back to the array:

eevans@aqs1014:~$ sudo lshw -class disk
  *-disk:0                  
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: DD01
       serial: KN09N7919I0709R2J
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=61f09f08
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: DD01
       serial: KN09N7919I0709R2N
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=39e57945
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@4:0.0.0
       logical name: /dev/sdc
       version: DD01
       serial: KN09N7919I0709R2Z
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=5246d65a
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@5:0.0.0
       logical name: /dev/sdd
       version: DD01
       serial: KN09N7919I0709R2G
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=0b5feb5a
  *-disk:0
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@6:0.0.0
       logical name: /dev/sde
       version: DD01
       serial: KN09N7919I0709R31
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=9937c259
  *-disk:1
       description: ATA Disk
       product: MZ7KH1T9HAJR0D3
       physical id: 1
       bus info: scsi@8:0.0.0
       logical name: /dev/sdf
       version: HF56
       serial: S4KVNA0MB04873
       size: 1788GiB (1920GB)
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0.0.0
       bus info: scsi@9:0.0.0
       logical name: /dev/sdg
       version: DD01
       serial: KN09N7919I0709R2K
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=5de9bd3a
eevans@aqs1014:~$ rm dmesg.log lshw.log 
eevans@aqs1014:~$ sudo lshw -class disk > lshw.out
eevans@aqs1014:~$ # Begin drive replacement
eevans@aqs1014:~$ sudo sgdisk -R /dev/sdf /dev/sde
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory. 
***************************************************************

The operation has completed successfully.
eevans@aqs1014:~$ sudo sgdisk -G /dev/sdf
The operation has completed successfully.
eevans@aqs1014:~$ sudo sgdisk -p /dev/sde
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.

***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory. 
***************************************************************

Disk /dev/sde: 3750748848 sectors, 1.7 TiB
Model: HFS1T9G32FEH-BA1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 88C30307-B20D-4729-B01F-4DB5F9290F80
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 2669 sectors (1.3 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        48828415   23.3 GiB    FD00  Linux RAID
   2        48828416      3750748159   1.7 TiB     FD00  Linux RAID
eevans@aqs1014:~$ sudo sgdisk -p /dev/sdf
Disk /dev/sdf: 3750748848 sectors, 1.7 TiB
Model: MZ7KH1T9HAJR0D3 
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 6E2D437D-26AC-42C7-A8E8-1E93A531E670
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 2669 sectors (1.3 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048        48828415   23.3 GiB    FD00  Linux RAID
   2        48828416      3750748159   1.7 TiB     FD00  Linux RAID
eevans@aqs1014:~$ sudo mdadm --manage /dev/md2 --add /dev/sdf2
mdadm: added /dev/sdf2
eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon Apr 29 19:23:59 2024
             State : clean, degraded, recovering 
    Active Devices : 2
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 0% complete

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 937972

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       4       8       82        1      spare rebuilding   /dev/sdf2
       -       0        0        2      removed
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1014:~$ # End
eevans@aqs1014:~$

The rebuild is complete:

eevans@aqs1014:~$ sudo mdadm --detail /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Tue Mar  9 14:18:06 2021
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Apr 30 21:23:36 2024
             State : active, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1014:2  (local to host aqs1014)
              UUID : 3477f980:543f02f4:52810833:b9cc0dfe
            Events : 996128

    Number   Major   Minor   RaidDevice State
       0       8       66        0      active sync set-A   /dev/sde2
       4       8       82        1      active sync set-B   /dev/sdf2
       -       0        0        2      removed
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1014:~$

@Jclark-ctr should we wait a few days to see if it sticks, or move on with replacing the final SSD?

	F49322477: dmesg.log
	Mon, Apr 29, 6:43 PM

	F48463635: dmesg.log
	Wed, Apr 24, 3:23 PM

	F48463632: lshw.log
	Wed, Apr 24, 3:23 PM

Degraded RAID on aqs1014Open, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Degraded RAID on aqs1014
Open, Needs TriagePublic
Actions