Page MenuHomePhabricator

Degraded RAID on centrallog1002
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host centrallog1002. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md1 : active raid10 sdh1[4] sdg1[2](F) sdf1[1] sde1[0]
      3750481920 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      bitmap: 4/28 pages [16KB], 65536KB chunk

md0 : active raid10 sdb2[0] sda2[1] sdd2[3] sdc2[2]
      1874534400 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 14/14 pages [56KB], 65536KB chunk

unused devices: <none>

Event Timeline

dmesg

[21683262.744660] sd 8:0:0:0: [sdg] tag#5 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21683262.744665] sd 8:0:0:0: [sdg] tag#5 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21683262.744668] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21683262.755288] sd 8:0:0:0: [sdg] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21683262.755291] sd 8:0:0:0: [sdg] tag#6 CDB: Read(10) 28 00 00 00 08 00 00 01 00 00
[21683262.755295] blk_update_request: I/O error, dev sdg, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21684788.327466] sd 8:0:0:0: [sdg] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21684788.327474] sd 8:0:0:0: [sdg] tag#8 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[21685080.344452] sd 8:0:0:0: [sdg] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21685080.344457] sd 8:0:0:0: [sdg] tag#12 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21685080.344460] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21685080.354839] sd 8:0:0:0: [sdg] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21685080.354844] sd 8:0:0:0: [sdg] tag#7 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21685080.354848] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21685080.365543] sd 8:0:0:0: [sdg] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21685080.365552] sd 8:0:0:0: [sdg] tag#21 CDB: Read(10) 28 00 00 00 08 00 00 01 00 00
[21685080.365559] blk_update_request: I/O error, dev sdg, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21685799.365665] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[21685799.365687] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[21685799.365699] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[21685799.397610] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[21685799.397624] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[21685799.397630] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[21686588.297640] sd 8:0:0:0: [sdg] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21686588.297644] sd 8:0:0:0: [sdg] tag#1 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[21686834.550389] sd 8:0:0:0: [sdg] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21686834.550395] sd 8:0:0:0: [sdg] tag#20 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21686834.550399] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21686834.561006] sd 8:0:0:0: [sdg] tag#9 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21686834.561016] sd 8:0:0:0: [sdg] tag#9 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21686834.561024] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21686834.571513] sd 8:0:0:0: [sdg] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21686834.571517] sd 8:0:0:0: [sdg] tag#2 CDB: Read(10) 28 00 00 00 08 00 00 01 00 00
[21686834.571523] blk_update_request: I/O error, dev sdg, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21688388.267690] sd 8:0:0:0: [sdg] tag#3 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21688388.267695] sd 8:0:0:0: [sdg] tag#3 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[21688675.785417] sd 8:0:0:0: [sdg] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21688675.785424] sd 8:0:0:0: [sdg] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21688675.785429] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21688675.795873] sd 8:0:0:0: [sdg] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21688675.795881] sd 8:0:0:0: [sdg] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
[21688675.795887] blk_update_request: I/O error, dev sdg, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0
[21688675.806451] sd 8:0:0:0: [sdg] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[21688675.806461] sd 8:0:0:0: [sdg] tag#19 CDB: Read(10) 28 00 00 00 08 00 00 01 00 00
[21688675.806468] blk_update_request: I/O error, dev sdg, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 0

@Jclark-ctr @VRiley-WMF please coordinate with me or other members of o11y to swap this drive, I believe this is one of the new drives we added recently (1.92TB SSD)

lshw as requested

centrallog1002:~$ sudo lshw -class disk
  *-disk:0                  
       description: ATA Disk
       product: SSDSC2KG960G8R
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sdb
       version: DL69
       serial: PHYG151001WS960CGN
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=9b7a25f3-2ef5-4b09-9791-7be17b800f2f logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: SSDSC2KG960G8R
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sda
       version: DL69
       serial: PHYG1510020S960CGN
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=36cda0cd-60d6-49e6-a867-eceebd56a98a logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: ATA Disk
       product: SSDSC2KG960G8R
       physical id: 2
       bus info: scsi@4:0.0.0
       logical name: /dev/sdc
       version: DL69
       serial: PHYG1510016J960CGN
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=d06d937b-fcc2-4245-a97e-d9a1a685ea1f logicalsectorsize=512 sectorsize=4096
  *-disk:3
       description: ATA Disk
       product: SSDSC2KG960G8R
       physical id: 3
       bus info: scsi@5:0.0.0
       logical name: /dev/sdd
       version: DL69
       serial: PHYG151001VF960CGN
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=0ffc1f26-34a8-4296-af7a-ba740f302435 logicalsectorsize=512 sectorsize=4096
  *-disk:0
       description: ATA Disk
       product: MZ7KH1T9HAJR0D3
       physical id: 0
       bus info: scsi@6:0.0.0
       logical name: /dev/sde
       version: HF56
       serial: S4KVNA0MB04215
       size: 1788GiB (1920GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=993046f7-44da-b14b-afde-14af2848c3db logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: MZ7KH1T9HAJR0D3
       physical id: 1
       bus info: scsi@7:0.0.0
       logical name: /dev/sdf
       version: HF56
       serial: S4KVNA0MB04226
       size: 1788GiB (1920GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=993046f7-44da-b14b-afde-14af2848c3db logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: SCSI Disk
       physical id: 2
       bus info: scsi@8:0.0.0
       logical name: /dev/sdg
       size: 1788GiB (1920GB)
       configuration: logicalsectorsize=512 sectorsize=4096
  *-disk:3
       description: ATA Disk
       product: MZ7KH1T9HAJR0D3
       physical id: 3
       bus info: scsi@9:0.0.0
       logical name: /dev/sdh
       version: HF56
       serial: S4KVNA0MB04228
       size: 1788GiB (1920GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=993046f7-44da-b14b-afde-14af2848c3db logicalsectorsize=512 sectorsize=4096

Hello DC-Ops team, I can be your o11y point of contact for this task as it's easier for us to coordinate timezone wise. Cheers.

Hey @andrea.denisse we will schedule it once we have figured out a solution for wiping these drives as it seems like this has been a problem cropping up recently.

@andrea.denisse We have been having a few issues with software raids we are trying to pinpoint what slot these are in. Idrac is not listing the drives. I will message you for assistance

Hello, here's the output of ls -la /dev/disk/by-path/ as requested:

total 0
drwxr-xr-x 2 root root 840 Mar 28 15:46 .
drwxr-xr-x 6 root root 120 Aug 22  2023 ..
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-3 -> ../../sdb
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-3.0 -> ../../sdb
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-3.0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-3.0-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-3-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-3-part2 -> ../../sdb2
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-4 -> ../../sda
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-4.0 -> ../../sda
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-4.0-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-4.0-part2 -> ../../sda2
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-4-part1 -> ../../sda1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-4-part2 -> ../../sda2
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-5 -> ../../sdc
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-5.0 -> ../../sdc
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-5.0-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-5.0-part2 -> ../../sdc2
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-5-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-5-part2 -> ../../sdc2
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-6 -> ../../sdd
lrwxrwxrwx 1 root root   9 Aug 22  2023 pci-0000:00:11.5-ata-6.0 -> ../../sdd
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-6.0-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-6.0-part2 -> ../../sdd2
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-6-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  10 Aug 22  2023 pci-0000:00:11.5-ata-6-part2 -> ../../sdd2
lrwxrwxrwx 1 root root   9 Mar 20 14:35 pci-0000:00:17.0-ata-1 -> ../../sde
lrwxrwxrwx 1 root root   9 Mar 20 14:35 pci-0000:00:17.0-ata-1.0 -> ../../sde
lrwxrwxrwx 1 root root  10 Mar 20 14:37 pci-0000:00:17.0-ata-1.0-part1 -> ../../sde1
lrwxrwxrwx 1 root root  10 Mar 20 14:37 pci-0000:00:17.0-ata-1-part1 -> ../../sde1
lrwxrwxrwx 1 root root   9 Mar 20 14:36 pci-0000:00:17.0-ata-2 -> ../../sdf
lrwxrwxrwx 1 root root   9 Mar 20 14:36 pci-0000:00:17.0-ata-2.0 -> ../../sdf
lrwxrwxrwx 1 root root  10 Mar 20 14:37 pci-0000:00:17.0-ata-2.0-part1 -> ../../sdf1
lrwxrwxrwx 1 root root  10 Mar 20 14:37 pci-0000:00:17.0-ata-2-part1 -> ../../sdf1
lrwxrwxrwx 1 root root   9 Mar 20 14:36 pci-0000:00:17.0-ata-3 -> ../../sdg
lrwxrwxrwx 1 root root   9 Mar 20 14:36 pci-0000:00:17.0-ata-3.0 -> ../../sdg
lrwxrwxrwx 1 root root  10 Mar 20 14:37 pci-0000:00:17.0-ata-3.0-part1 -> ../../sdg1
lrwxrwxrwx 1 root root  10 Mar 20 14:37 pci-0000:00:17.0-ata-3-part1 -> ../../sdg1
lrwxrwxrwx 1 root root   9 Mar 28 15:46 pci-0000:00:17.0-ata-4 -> ../../sdh
lrwxrwxrwx 1 root root   9 Mar 28 15:46 pci-0000:00:17.0-ata-4.0 -> ../../sdh
lrwxrwxrwx 1 root root  10 Mar 28 15:47 pci-0000:00:17.0-ata-4.0-part1 -> ../../sdh1
lrwxrwxrwx 1 root root  10 Mar 28 15:47 pci-0000:00:17.0-ata-4-part1 -> ../../sdh1

@Jclark-ctr @VRiley-WMF when the task was auto generated, it shows that disk sdg1 failed see in task description line below (F)

md1 : active raid10 sdh1[4] sdg1[2](F) sdf1[1] sde1[0]
Today when running

cat /proc/mdstat

I get md1 : active raid10 sdh1[4] sdf1[1] sde1[0] disk sdg1 is missing. By looking at lshw output I can tell that this server is configured with 2 block of 4 disks.
first block is 4x960GB
second block is 4*1.92TB
The first block will be disk 0 to disk 3 and the second block will be disk 4 to disk 7. Another information that the lshw output gives us is that the disk that is missing (bad) is a 1.92TB disk see below

-disk:2
   description: SCSI Disk
   physical id: 2
   bus info: scsi@8:0.0.0
   l**ogical name: /dev/sdg**
   **size: 1788GiB (1920GB)**
   configuration: logicalsectorsize=512 sectorsize=4096

Base on above output and the one below

md1 : active raid10 sdh1[4] sdf1[1] sde1[0]
      3750481920 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]

you can tell that the bad disk is disk 3 on the second block. Counting from 0-7 this will be disk in slot 6
Note:
If you look also carefully the output of lshw on the bad disk, the description is set ti SCSI DISK and there is no serial number information it or on the other working disk it gives you all that information.

BAD DISK

*-disk:2
       description: SCSI Disk
       physical id: 2
       bus info: scsi@8:0.0.0
       logical name: /dev/sdg
       size: 1788GiB (1920GB)
       configuration: logicalsectorsize=512 sectorsize=4096

GOOD DISK

*-disk:3
     description: ATA Disk
     product: MZ7KH1T9HAJR0D3
     physical id: 3
     bus info: scsi@9:0.0.0
     logical name: /dev/sdh
     version: HF56
     serial: S4KVNA0MB04228
     size: 1788GiB (1920GB)
     capabilities: gpt-1.00 partitioned partitioned:gpt
     configuration: ansiversion=5 guid=993046f7-44da-b14b-afde-14af2848c3db logicalsectorsize=512 sectorsize=4096

Let me know if you have any questions

Output of the requested commands:

denisse@centrallog1002:~$ sudo sgdisk -R=/dev/sdg /dev/sdh
The operation has completed successfully.
denisse@centrallog1002:~$  sudo sgdisk -G /dev/sdg
The operation has completed successfully.
denisse@centrallog1002:~$  sudo sgdisk -p /dev/sdh
Disk /dev/sdh: 3750748848 sectors, 1.7 TiB
Model: MZ7KH1T9HAJR0D3
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 993046F7-44DA-B14B-AFDE-14AF2848C3DB
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      3750748814   1.7 TiB     FD00
denisse@centrallog1002:~$ sudo sgdisk -p /dev/sdg
Disk /dev/sdg: 3750748848 sectors, 1.7 TiB
Model: MZ7KH1T9HAJR0D3
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 1038ED6E-498E-4AB5-9BB3-4A340AE0BF5B
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 2048, last usable sector is 3750748814
Partitions will be aligned on 2048-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      3750748814   1.7 TiB     FD00

Mentioned in SAL (#wikimedia-operations) [2024-05-03T17:13:47Z] <denisse> Run sudo mdadm --add /dev/md1 /dev/sdg on centrallog1002 - T363660

denisse@centrallog1002:~$ sudo mdadm --add /dev/md1 /dev/sdg
mdadm: added /dev/sdg

The resync finished.

sudo cat /proc/mdstat                                                      centrallog1002: Fri May  3 22:00:07 2024

Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md1 : active raid10 sdg[5] sdh1[4] sdf1[1] sde1[0]
      3750481920 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 4/28 pages [16KB], 65536KB chunk

md0 : active raid10 sdb2[0] sda2[1] sdd2[3] sdc2[2]
      1874534400 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 3/14 pages [12KB], 65536KB chunk

unused devices: <none>

Thanks to @VRiley-WMF and @Jclark-ctr for their help debugging and troubleshooting this issue, it was a hard one! ❤

Thank you all for looking into this!

Generally LGTM, the only thing I would have done differently is to copy the partitioning from existing disks and then add the first partition to the raid, not the whole disk (i.e. sdg vs sdg1) for symmetry with what we usually do.

In other words:

# copy partition table from a working disk into sdg
sfdisk -d /dev/sdh | sfdisk /dev/sdg
mdadm --add /dev/md1 /dev/sdg1

@fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the new drive?

@fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the new drive?

Good question, I think we're good as-is since the hardware will get refreshed eventually and if we have to swap the drive earlier than that then we can repartition as needed

@fgiunchedi Good to know, thank you. Do you think we should do the syncing again to the new drive?

Good question, I think we're good as-is since the hardware will get refreshed eventually and if we have to swap the drive earlier than that then we can repartition as needed

Okay, I'll leave them like they are and I'll add documentation on Wikitech on how to sync them properly. Thank you!

I'm not sure exactly what happened, though while working today on {T366555} centrallog1002 md1 raid wouldn't come up cleanly. I've assembled it with three disks and then put back the fourth; also correcting this mismatch in the process