Degraded RAID on ms-be2028
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Apr 4 2021, 3:55 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host ms-be2028. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 3: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P840 in Slot 3

   array A

      Logical Drive: 1
         Size: 447.1 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: /srv/swift-storage/sda4 297.2 GB Partition Number 5, /srv/swift-storage/sda3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array B

      Logical Drive: 2
         Size: 447.1 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sdb 
         Mount Points: /srv/swift-storage/sdb4 297.2 GB Partition Number 5, /srv/swift-storage/sdb3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array C

      Logical Drive: 3
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: /srv/swift-storage/sdc1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array D

      Logical Drive: 4
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: /srv/swift-storage/sdd1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array E

      Logical Drive: 5
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sde 
         Mount Points: /srv/swift-storage/sde1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array F

      Logical Drive: 6
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdf 
         Mount Points: /srv/swift-storage/sdf1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array G

      Logical Drive: 7
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdg 
         Mount Points: /srv/swift-storage/sdg1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array H

      Logical Drive: 8
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: Failed
         MultiDomain Status: OK
         Caching:  Enabled
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array I

      Logical Drive: 9
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdh 
         Mount Points: /srv/swift-storage/sdi1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array J

      Logical Drive: 10
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdi 
         Mount Points: /srv/swift-storage/sdj1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array K

      Logical Drive: 11
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdj 
         Mount Points: /srv/swift-storage/sdk1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array L

      Logical Drive: 12
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdk 
         Mount Points: /srv/swift-storage/sdl1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array M

      Logical Drive: 13
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdl 
         Mount Points: /srv/swift-storage/sdm1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array N

      Logical Drive: 14
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdm 
         Mount Points: /srv/swift-storage/sdn1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Related Objects

Mentioned In: T312595: ms-be2028 on stretch

Event Timeline

ops-monitoring-bot created this task.Apr 4 2021, 3:55 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 4 2021, 3:55 AM

jijiki triaged this task as High priority.Apr 5 2021, 10:02 AM

jijiki added a project: DBA.

jijiki added a subscriber: fgiunchedi.

@jijiki as far as I know we don't own ms-be hosts yet, so not sure if the DBA tag is appropriated, I will leave that up to @LSobanski

@Marostegui @LSobanski I will leave it to you :)

We don't. Removing the DBA tag (but staying subscribed).

@Papaul please replace the failed 4TB disk, led should be blinking, thank you !

@fgiunchedi disk replaced

RhinosF1 subscribed.Apr 6 2021, 3:02 PM

Thank you @Papaul !

@Papaul I'm running into troubles with the disk I haven't seen before (xfs crashes after a while, log below). Can we try another spare disk just to exclude the disk itself as faulty (or just plain old)? Thank you!

[ 2176.368643] INFO: task xfs_db:23429 blocked for more than 120 seconds.
[ 2176.397800]       Not tainted 4.9.0-15-amd64 #1 Debian 4.9.258-1
[ 2176.424891] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2176.460097] xfs_db          D    0 23429      1 0x00000000
[ 2176.460103]  0000000000000086 ffff88830d291400 ffff887244e3c400 ffff888307d55100
[ 2176.460106]  ffff8884ff098a00 ffff88737a53c100 ffffb25628c23d00 ffffffffa3a1c169
[ 2176.460109]  ffffffffa3639ad0 00ff8884f4c862c0 ffff8884ff098a00 ffff8884f4c865c8
[ 2176.460113] Call Trace:
[ 2176.460123]  [<ffffffffa3a1c169>] ? __schedule+0x239/0x6f0
[ 2176.460128]  [<ffffffffa3639ad0>] ? wb_queue_work+0xc0/0xd0
[ 2176.460130]  [<ffffffffa3a1c652>] ? schedule+0x32/0x80
[ 2176.460132]  [<ffffffffa3639b3e>] ? wb_wait_for_completion+0x5e/0x90
[ 2176.460137]  [<ffffffffa34be070>] ? prepare_to_wait_event+0xf0/0xf0
[ 2176.460139]  [<ffffffffa363bcc8>] ? __writeback_inodes_sb_nr+0x98/0xe0
[ 2176.460144]  [<ffffffffa3642266>] ? __sync_filesystem+0x46/0x50
[ 2176.460146]  [<ffffffffa36424f6>] ? sync_filesystem+0x26/0x50
[ 2176.460149]  [<ffffffffa364a5ff>] ? fsync_bdev+0x1f/0x50
[ 2176.460154]  [<ffffffffa3718713>] ? blkdev_ioctl+0x5a3/0x960
[ 2176.460157]  [<ffffffffa364a019>] ? block_ioctl+0x39/0x40
[ 2176.460161]  [<ffffffffa3621832>] ? do_vfs_ioctl+0xa2/0x620
[ 2176.460165]  [<ffffffffa361280f>] ? SYSC_newfstat+0x2f/0x50
[ 2176.460167]  [<ffffffffa3621e24>] ? SyS_ioctl+0x74/0x80
[ 2176.460172]  [<ffffffffa3403b7d>] ? do_syscall_64+0x8d/0x100
[ 2176.460177]  [<ffffffffa3a2104e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[ 2916.356849] perf: interrupt took too long (6150 > 6147), lowering kernel.perf_event_max_sample_rate to 32500
[ 6251.572127] XFS (sdh1): Metadata corruption detected at xfs_attr3_leaf_write_verify+0xe8/0x100 [xfs], xfs_attr3_leaf block 0x1b634abf8
[ 6251.626761] XFS (sdh1): Unmount and run xfs_repair
[ 6251.648340] XFS (sdh1): First 64 bytes of corrupted metadata buffer:
[ 6251.676916] ffff886e6fcfc000: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
[ 6251.716164] ffff886e6fcfc010: 00 00 00 01 b6 34 ab f8 00 00 00 00 00 00 00 00  .....4..........
[ 6251.755531] ffff886e6fcfc020: 80 10 2e 59 e4 3b 43 cc 88 eb 60 cc f9 4b e0 5e  ...Y.;C...`..K.^
[ 6251.794865] ffff886e6fcfc030: 00 00 00 01 e1 91 12 1a 00 00 00 00 10 00 00 00  ................
[ 6251.834257] XFS (sdh1): xfs_do_force_shutdown(0x8) called from line 1375 of file /build/linux-oA5nb9/linux-4.9.258/fs/xfs/xfs_buf.c.  Return address = 0xffffffffc0d466aa
[ 6252.058676] XFS (sdh1): Corruption of in-memory data detected.  Shutting down filesystem
[ 6252.087329] XFS (sdh1): writeback error on sector 29492736
[ 6252.087373] XFS (sdh1): writeback error on sector 759804640
[ 6252.087383] XFS (sdh1): writeback error on sector 3201247008
[ 6252.087389] XFS (sdh1): writeback error on sector 4182614792
[ 6252.087393] XFS (sdh1): writeback error on sector 4672487088
[ 6252.087427] XFS (sdh1): writeback error on sector 4914266096
[ 6252.087467] XFS (sdh1): writeback error on sector 5157651848
[ 6252.087471] XFS (sdh1): writeback error on sector 5406440952
[ 6252.087569] XFS (sdh1): writeback error on sector 2265886248
[ 6252.087573] XFS (sdh1): writeback error on sector 2959119840
[ 6252.349662] XFS (sdh1): Please umount the filesystem and rectify the problem(s)

• Marostegui unsubscribed.Apr 8 2021, 10:21 AM

fgiunchedi merged a task: T279644: Degraded RAID on ms-be2028.Apr 8 2021, 11:04 AM

Disk replaced

fgiunchedi added a project: User-fgiunchedi.Apr 13 2021, 2:11 PM

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Apr 13 2021, 2:21 PM

Thank you @Papaul, all good

Dzahn mentioned this in T312595: ms-be2028 on stretch.Jul 7 2022, 9:18 PM

Degraded RAID on ms-be2028Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on ms-be2028
Closed, ResolvedPublic
Actions