Page MenuHomePhabricator

Spontaneous reboot of ms-be2045
Open, MediumPublic

Description

15:51 - <icinga-wm> PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss =100%

System rebooted, but disks did not come back up correctly. rackadm getsel reports two issues:
Record: 2
Date/Time: 09/13/2021 15:47:31
Source: system
Severity: Critical
Description: A fatal error was detected on a component at bus 59 device 0 function 0.

Record: 4
Date/Time: 09/13/2021 15:47:31
Source: system
Severity: Critical
Description: A fatal error was detected on a component at bus 58 device 2 function 0.

from lspci, I think these correspond to:
3b:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
and
3a:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port 1C (rev 04)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-09-13T15:20:19Z] <Emperor> rebooting ms-be2045 to see if that brings the disk back properly T290881

On reboot, the disks came back, but many of the filesystems are unhappy:
mvernon@ms-be2045:~$ sudo dmesg | grep 'Shutting down filesystem'
[ 18.244602] XFS (sda3): Corruption of in-memory data detected. Shutting down filesystem
[ 18.724649] XFS (sdf1): Corruption of in-memory data detected. Shutting down filesystem
[ 19.448076] XFS (sdg1): Corruption of in-memory data detected. Shutting down filesystem
[ 20.610420] XFS (sdj1): I/O Error Detected. Shutting down filesystem
[ 20.745769] XFS (sdn1): I/O Error Detected. Shutting down filesystem
[ 20.938081] XFS (sdi1): I/O Error Detected. Shutting down filesystem
[ 23.719222] XFS (sdh1): I/O Error Detected. Shutting down filesystem
[ 24.802161] XFS (sde1): I/O Error Detected. Shutting down filesystem
[ 30.091057] XFS (sdm1): I/O Error Detected. Shutting down filesystem
[ 31.761276] XFS (sdc1): I/O Error Detected. Shutting down filesystem

MatthewVernon added a project: ops-codfw.

Hi @Papaul this system seems to have had a hardware fault(s), and is (just) still within its warranty, could you get the hardware checked out, please? Thanks :)

Mentioned in SAL (#wikimedia-operations) [2021-09-14T08:05:07Z] <godog> wipe non-os partitions from ms-be2045 - T290881

Mentioned in SAL (#wikimedia-operations) [2021-09-14T08:25:04Z] <godog> poweroff ms-be2045 and set it as failed in netbox - T290881

Change 720917 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: remove host ms-be2045

https://gerrit.wikimedia.org/r/720917

Change 720917 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: remove host ms-be2045

https://gerrit.wikimedia.org/r/720917

Mentioned in SAL (#wikimedia-operations) [2021-09-14T09:09:10Z] <Emperor> swift rebalance to remove h/w faulty host ms-be2045 T290881

Mentioned in SAL (#wikimedia-operations) [2021-09-15T06:57:54Z] <elukey> shutdown ms-be2045 (again) after seeing T290881

Papaul triaged this task as Medium priority.Mon, Sep 27, 3:06 AM