Page MenuHomePhabricator

Spontaneous reboot of ms-be2045
Closed, ResolvedPublic

Description

15:51 - <icinga-wm> PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss =100%

System rebooted, but disks did not come back up correctly. rackadm getsel reports two issues:
Record: 2
Date/Time: 09/13/2021 15:47:31
Source: system
Severity: Critical
Description: A fatal error was detected on a component at bus 59 device 0 function 0.

Record: 4
Date/Time: 09/13/2021 15:47:31
Source: system
Severity: Critical
Description: A fatal error was detected on a component at bus 58 device 2 function 0.

from lspci, I think these correspond to:
3b:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
and
3a:02.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port 1C (rev 04)

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-09-13T15:20:19Z] <Emperor> rebooting ms-be2045 to see if that brings the disk back properly T290881

On reboot, the disks came back, but many of the filesystems are unhappy:
mvernon@ms-be2045:~$ sudo dmesg | grep 'Shutting down filesystem'
[ 18.244602] XFS (sda3): Corruption of in-memory data detected. Shutting down filesystem
[ 18.724649] XFS (sdf1): Corruption of in-memory data detected. Shutting down filesystem
[ 19.448076] XFS (sdg1): Corruption of in-memory data detected. Shutting down filesystem
[ 20.610420] XFS (sdj1): I/O Error Detected. Shutting down filesystem
[ 20.745769] XFS (sdn1): I/O Error Detected. Shutting down filesystem
[ 20.938081] XFS (sdi1): I/O Error Detected. Shutting down filesystem
[ 23.719222] XFS (sdh1): I/O Error Detected. Shutting down filesystem
[ 24.802161] XFS (sde1): I/O Error Detected. Shutting down filesystem
[ 30.091057] XFS (sdm1): I/O Error Detected. Shutting down filesystem
[ 31.761276] XFS (sdc1): I/O Error Detected. Shutting down filesystem

MatthewVernon added a project: ops-codfw.

Hi @Papaul this system seems to have had a hardware fault(s), and is (just) still within its warranty, could you get the hardware checked out, please? Thanks :)

Mentioned in SAL (#wikimedia-operations) [2021-09-14T08:05:07Z] <godog> wipe non-os partitions from ms-be2045 - T290881

Mentioned in SAL (#wikimedia-operations) [2021-09-14T08:25:04Z] <godog> poweroff ms-be2045 and set it as failed in netbox - T290881

Change 720917 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: remove host ms-be2045

https://gerrit.wikimedia.org/r/720917

Change 720917 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: remove host ms-be2045

https://gerrit.wikimedia.org/r/720917

Mentioned in SAL (#wikimedia-operations) [2021-09-14T09:09:10Z] <Emperor> swift rebalance to remove h/w faulty host ms-be2045 T290881

Mentioned in SAL (#wikimedia-operations) [2021-09-15T06:57:54Z] <elukey> shutdown ms-be2045 (again) after seeing T290881

Papaul triaged this task as Medium priority.Sep 27 2021, 3:06 AM

Create a case with Dell. case bellow
case#: 123351649 Tag: R740XD: Crashes | ProSupport: NBD | - Linux -

I see 1 error here.

Bios is on 1.5.4 ,
New BIOS is 2.12.2 : https://dl.dell.com/FOLDER07551855M/4/BIOS_4CRD2_WN64_2.12.2.EXE

Idrac is on 3.21.21.21
Current is 5.00.10.00

Raid Backplane FW is 2.25
Current is 2.52 : https://dl.dell.com/FOLDER06636966M/1/Firmware_60K1J_WN64_2.52_A00.EXE

PERC ( this is what appeared to crash)
is on 25.5.5.0005
Current is 25.5.9.0001 : https://dl.dell.com/FOLDER07217671M/1/SAS-RAID_Firmware_700GG_WN64_25.5.9.0001_A17.EXE

I would suggest that these all be updated to current. Then boot up the server and pull a TSR while its on and running.

We will then want to monitor it to see if its stable on this new firmware.

Thanks,

Andrew Clausen
Technical Support Engineer | Linux and Virtualization

All Firmware upgraded on the server

@Papaul is there anything needed from us at this time? thank you!

@fgiunchedi no nothing needed. I just left the task open to monitor the server. It looks there is no issue yet so I will update Dell and let them close the case.

Thank you.

I checked the server today all looking good. closing this task

Mentioned in SAL (#wikimedia-operations) [2021-10-07T07:57:14Z] <Emperor> re-enabling puppet on ms-be2045 after hw work T290881

Cookbook cookbooks.sre.experimental.reimage was started by mvernon@cumin2002 for host ms-be2045.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage started by mvernon@cumin2002 for host ms-be2045.codfw.wmnet completed:

  • ms-be2045 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/wmf-auto-reimage/202110070848_mvernon_1763896_ms-be2045.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed

Hi @Papaul We reimaged this host today to try and bring it back into service. After about half an hour of uptime it dropped off the network, and from the management console it looks like the network hardware has failed?

paste at https://phabricator.wikimedia.org/P17433

So I think there's still a h/w problem on this host, would you mind getting Dell to have another look, please?
Thanks!

@fgiunchedi I will re-open the case with Dell. Thanks

The first crash was because of the PERC ( this is what appeared to crash). and it was showing in the IDRAC log. The second crash after the re-image is about the network card as @fgiunchedi mentioned in his comment; but this crash is not showing in the log. I upgrade the Nic firmware from version 20.84 to 21.40.25.31.

Mentioned in SAL (#wikimedia-operations) [2021-10-08T09:39:10Z] <Emperor> installing stress on ms-be2045 given recent h/w issues T290881

MatthewVernon added a subscriber: Papaul.

@Papaul system was stable over the weekend, so I'll take this ticket and start restoring this system to the Swift rings. Thanks!

Change 730000 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: start re-adding weight to ms-be2045

https://gerrit.wikimedia.org/r/730000

Change 730000 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: start re-adding weight to ms-be2045

https://gerrit.wikimedia.org/r/730000

Mentioned in SAL (#wikimedia-operations) [2021-10-11T14:36:30Z] <Emperor> start restoring weight to ms-be2045 T290881

Change 730442 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: more weight to ms-be2045

https://gerrit.wikimedia.org/r/730442

Change 730442 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: more weight to ms-be2045

https://gerrit.wikimedia.org/r/730442

Change 730710 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: more weight to ms-be2045

https://gerrit.wikimedia.org/r/730710

Change 730710 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: more weight to ms-be2045

https://gerrit.wikimedia.org/r/730710

Change 730976 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: final weight to ms-be2045

https://gerrit.wikimedia.org/r/730976

Change 730976 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: final weight to ms-be2045

https://gerrit.wikimedia.org/r/730976

Full weight restored, so closing this (again ;-) )