Page MenuHomePhabricator

Degraded RAID on cloudvirt1018
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host cloudvirt1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'cloudvirt1018', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

STDERR:
None

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-02-13T12:32:17Z] <arturo> T216030 T216004 rebooting cloudvirt1018

After a reboot, I got this in the console:

13:43 <arturo> The following VDs have missing disks: 000
13:43 <arturo> If you proceed (or load the configuration utility), these VDs
13:43 <arturo> TherebareaofflineForNmissingivirtualndrivesiwith preserved cache.
13:43 <arturo> Please check thercablessanddensureethat allkdrivespareepresent.
13:43 <arturo> Press any key to enternthe configurationdutility.figuration utility.

There are apparently 2 failed disks?

a ticket with Dell has been submitted to replace both SSDs

You have successfully submitted request SR986376069.

Volans added a subscriber: Volans.Feb 13 2019, 1:22 PM

I've merged the other auto-generated tasks into this one. FYI, as you can see on some of the others (where NRPE didn't timeout) the RAID was rebuilding.

The server is back online:

14:43 <gtirloni> slot 2: failed / slot 3: offline. I forced slot 2 offline and then online. I then forced slot 3 online. Reboot server and RAID reported the VD was fine.

/var/lib/nova/instances took some damage today:

root@cloudvirt1018:~# xfs_repair /dev/mapper/tank-data 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
root@cloudvirt1018:~# mount /dev/mapper/tank-data
[  149.654569] XFS (dm-0): Internal error XFS_WANT_CORRUPTED_GOTO at line 1714 of file /build/linu]
[  149.674517] XFS (dm-0): Internal error xfs_trans_cancel at line 1005 of file /build/linux-4.9-e]
[  149.693065] XFS (dm-0): Corruption of in-memory data detected.  Shutting down filesystem
[  149.702101] XFS (dm-0): Please umount the filesystem and rectify the problem(s)
[  149.710268] XFS (dm-0): Failed to recover intents
mount: mount /dev/mapper/tank-data on /var/lib/nova/instances failed: Structure needs cleaning
root@cloudvirt1018:~# umount /dev/mapper/tank-data 
umount: /dev/mapper/tank-data: not mounted
root@cloudvirt1018:~# xfs_repair /dev/mapper/tank-data 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
root@cloudvirt1018:~# xfs_repair -L /dev/mapper/tank-data 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
agf_freeblks 168848958, counted 168848846 in ag 5
agf_longest 165198592, counted 165198480 in ag 5
agf_freeblks 208718165, counted 208718168 in ag 2
agf_freeblks 146988817, counted 146988816 in ag 1
sb_icount 192, counted 640
sb_ifree 183, counted 244
sb_fdblocks 1927369198, counted 1296528935
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
correcting nblocks for inode 218, was 2434648 - counted 2435592
correcting nextents for inode 218, was 1837 - counted 1838
correcting nblocks for inode 241, was 4665219 - counted 4665075
correcting nblocks for inode 246, was 4545731 - counted 4545747
correcting nextents for inode 246, was 51 - counted 52
        - agno = 1
correcting nblocks for inode 4294967508, was 5161779 - counted 5161811
correcting nextents for inode 4294967508, was 246 - counted 247
correcting nblocks for inode 4298385009, was 65 - counted 513
correcting nextents for inode 4298385009, was 10 - counted 13
        - agno = 2
data fork in ino 8589934808 claims free block 600726796
data fork in ino 8589934810 claims free block 601097573
data fork in ino 8590387776 claims free block 601128261
correcting nblocks for inode 8590387776, was 1847669 - counted 1869397
correcting nextents for inode 8590387776, was 1113 - counted 1114
correcting nblocks for inode 8590387786, was 4442544 - counted 4443200
correcting nextents for inode 8590387786, was 11711 - counted 11712
        - agno = 3
data fork in regular inode 12884902219 claims used block 805338394
correcting nextents for inode 12884902219
bad data fork in inode 12884902219
cleared inode 12884902219
data fork in regular inode 12884902225 claims used block 805343092
correcting nextents for inode 12884902225
bad data fork in inode 12884902225
cleared inode 12884902225
correcting nblocks for inode 12884902228, was 3807452 - counted 3807468
correcting nextents for inode 12884902228, was 2657 - counted 2658
data fork in regular inode 12884902247 claims used block 805338214
correcting nextents for inode 12884902247
bad data fork in inode 12884902247
cleared inode 12884902247
data fork in regular inode 12884902248 claims used block 805342570
correcting nextents for inode 12884902248
bad data fork in inode 12884902248
cleared inode 12884902248
        - agno = 4
        - agno = 5
data fork in ino 21474836820 claims free block 601103301
data fork in regular inode 21474836832 claims used block 394004869
correcting nextents for inode 21474836832
bad data fork in inode 21474836832
cleared inode 21474836832
        - agno = 6
bad fwd (right) sibling pointer (saw 1920834781 should be NULLDFSBNO)
	in inode 25769804101 (data fork) bmap btree block 1660378748
bad data fork in inode 25769804101
cleared inode 25769804101
data fork in ino 25769804102 claims free block 1610612793
data fork in ino 25769804102 claims free block 1610612817
data fork in ino 25769804102 claims free block 1610612878
data fork in ino 25769804102 claims free block 1610613009
data fork in ino 25769804102 claims free block 1610613010
data fork in ino 25769804102 claims free block 1610636040
data fork in ino 25769804102 claims free block 1610636041
data fork in ino 25769804102 claims free block 1635944425
data fork in ino 25769804102 claims free block 1635944426
data fork in ino 25769804102 claims free block 1636273329
data fork in ino 25769804102 claims free block 1636273330
data fork in ino 25769804103 claims free block 1610612859
data fork in ino 25769804103 claims free block 1610612860
data fork in ino 25769804103 claims free block 1610613171
data fork in ino 25769804103 claims free block 1610613172
data fork in ino 25769804103 claims free block 1610613213
data fork in ino 25769804103 claims free block 1610613214
data fork in ino 25769804103 claims free block 1610772190
data fork in ino 25769804103 claims free block 1610772191
data fork in ino 25769804107 claims free block 1610612877
data fork in ino 25769804107 claims free block 1610613627
data fork in ino 25769804107 claims free block 1610613628
data fork in ino 25769804107 claims free block 1635934456
data fork in ino 25769804107 claims free block 1635934457
data fork in regular inode 25769804112 claims used block 1610612766
correcting nextents for inode 25769804112
bad data fork in inode 25769804112
cleared inode 25769804112
correcting nblocks for inode 25769804122, was 4415139 - counted 4415171
data fork in regular inode 25769804124 claims used block 1610612750
correcting nextents for inode 25769804124
bad data fork in inode 25769804124
cleared inode 25769804124
data fork in regular inode 25769804130 claims used block 1610613162
correcting nextents for inode 25769804130
bad data fork in inode 25769804130
cleared inode 25769804130
data fork in regular inode 25769804132 claims used block 1610612751
correcting nextents for inode 25769804132
bad data fork in inode 25769804132
cleared inode 25769804132
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
entry "disk" in shortform directory 25769804096 references free inode 25769804101
junking entry "disk" in directory inode 25769804096
corrected i8 count in directory 25769804096, was 4, now 3
        - agno = 7
data fork in ino 25769804102 claims dup extent, off - 0, start - 1610612762, cnt 5
correcting nextents for inode 25769804102
bad data fork in inode 25769804102
cleared inode 25769804102
data fork in ino 25769804103 claims dup extent, off - 0, start - 1610612750, cnt 2
correcting nextents for inode 25769804103
bad data fork in inode 25769804103
cleared inode 25769804103
entry "libvirt.xml" in shortform directory 25769804109 references free inode 25769804112
junking entry "libvirt.xml" in directory inode 25769804109
corrected i8 count in directory 25769804109, was 4, now 3
entry "libvirt.xml" in shortform directory 25769804116 references free inode 25769804124
junking entry "libvirt.xml" in directory inode 25769804116
corrected i8 count in directory 25769804116, was 4, now 3
entry "console.log.2" at block 0 offset 256 in directory inode 12884902218 references free inode 19
	clearing inode number in entry at offset 256...
entry "console.log" at block 0 offset 1888 in directory inode 12884902218 references free inode 125
	clearing inode number in entry at offset 1888...
entry "console.log.1" at block 0 offset 568 in directory inode 12884902246 references free inode 17
	clearing inode number in entry at offset 568...
entry "console.log.0" at block 0 offset 592 in directory inode 12884902246 references free inode 18
	clearing inode number in entry at offset 592...
entry "disk" in shortform directory 21474836827 references free inode 21474836832
junking entry "disk" in directory inode 21474836827
corrected i8 count in directory 21474836827, was 4, now 3
data fork in ino 4298385009 claims dup extent, off - 0, start - 805338214, cnt 2
entry "console.log" in shortform directory 25769804129 references free inode 25769804130
correcting nextents for inode 4298385009
junking entry "console.log" in directory inode 25769804129
bad data fork in inode 4298385009
entry "libvirt.xml" in shortform directory 25769804129 references free inode 25769804132
cleared inode 4298385009
junking entry "libvirt.xml" in directory inode 25769804129
corrected i8 count in directory 25769804129, was 4, now 2
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
entry "console.log" in directory inode 4298385001 points to free inode 4298385009
bad hash table for directory inode 4298385001 (no data entry): rebuilding
rebuilding directory inode 4298385001
bad hash table for directory inode 12884902218 (no data entry): rebuilding
rebuilding directory inode 12884902218
bad hash table for directory inode 12884902246 (no data entry): rebuilding
rebuilding directory inode 12884902246
entry "console.log.0" in directory inode 25769804098 points to free inode 25769804102
entry "console.log" in directory inode 25769804098 points to free inode 25769804103
bad hash table for directory inode 25769804098 (no data entry): rebuilding
rebuilding directory inode 25769804098
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Metadata corruption detected at block 0x180000040/0x1000
libxfs_writebufr: write verifer failed on bno 0x180000040/0x1000
Metadata corruption detected at block 0x180000038/0x1000
libxfs_writebufr: write verifer failed on bno 0x180000038/0x1000
Metadata corruption detected at block 0x801a16f0/0x1000
libxfs_writebufr: write verifer failed on bno 0x801a16f0/0x1000
Metadata corruption detected at block 0x300000060/0x1000
libxfs_writebufr: write verifer failed on bno 0x300000060/0x1000
Metadata corruption detected at block 0x300000060/0x1000
libxfs_writebufr: write verifer failed on bno 0x300000060/0x1000
Metadata corruption detected at block 0x180000040/0x1000
libxfs_writebufr: write verifer failed on bno 0x180000040/0x1000
Metadata corruption detected at block 0x180000038/0x1000
libxfs_writebufr: write verifer failed on bno 0x180000038/0x1000
Metadata corruption detected at block 0x801a16f0/0x1000
libxfs_writebufr: write verifer failed on bno 0x801a16f0/0x1000
done

It's mounted and VMs starting off of it again. It's not clear if there was actual data loss yet.

GTirloni added a subscriber: RobH.Feb 13 2019, 7:15 PM

@RobH @Cmjohnson

Slots 2 & 3 were part of the outage today. Even though they show as online, could we replace them? They are likely in a pair so we'll need to do it one at a time.

I'm not sure what kind of questions Dell asks us before accepting to replace a failed disk. These two could be behaving badly under load and/or have various SMART metrics in pre-fail stage.

colewhite triaged this task as Normal priority.Feb 13 2019, 8:31 PM

Looking in the RAID controller firmware logs, it seems we have consistent issues with all disks (which could point to a faulty controller, cable or enclosure). What do you think?

SAS devices with resets / invalid PD sense (cX is slot):

     76 500056b3f5bd95c6
     76 500056b3f5bd95c1
     72 500056b3f5bd95c0
     68 500056b3f5bd95c8
     66 500056b3f5bd95c2
     65 500056b3f5bd95c4
     60 500056b3f5bd95c5
     56 500056b3f5bd95c7
     52 500056b3f5bd95c3
     46 500056b3f5bd95c9
RobH added a subscriber: Andrew.EditedFeb 13 2019, 8:56 PM

update from irc chat:

Dell support site for this systems firmware updates: https://www.dell.com/support/home/us/en/04/product-support/servicetag/31s8kh2/drivers

I'd recommend we update the firmware on the raid controller for this one, as it seems there has been 3 urgent updates since the currently installed firmware and @GTirloni's comment is quite valid, it seems unlikely that we have that many bad SSDs. The alternative then is a bad backplane or raid controller.

cloudvirt1018 firmware details:
https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverId=T9YX9&osCode=W12R2&productCode=poweredge-r630
bios installed: 2.4.3
bios newest: 2.9.1

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverId=1HY5M&osCode=W12R2&productCode=poweredge-r630
idrac installed: 2.40.40.40
idrac newest: 2.61.60.60

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverId=F675Y&osCode=W12R2&productCode=poweredge-r630
perc installed: 25.5.0.0018 (3 urgent patches old)
perc newest: 25.5.5.0005

@Andrew notes this will likely take until tomorrow to drain from use and allow troubleshooting. Once we update the firmware, it would be interesting to see if the failures drop off.

Edit addition: I'm happy to apply firmware updates when needed.

This host is now fully drained, so the dcops folks can do whatever, whenever.

@GTirloni The disks in slots 2 and 3 have been replaced.

Return shipping info
USPS 9202 3946 5301 2441 0207 95
FEDEX 9611918 2393026 77770812

All good, thank you!

GTirloni closed this task as Resolved.Feb 20 2019, 7:01 PM