Page MenuHomePhabricator

ganeti1019 is down
Closed, ResolvedPublic

Description

Failed to come up after a reboot, SEL shows:

------------------------------------------------------------------------------
Record:      14
Date/Time:   06/10/2024 14:43:06
Source:      system
Severity:    Ok
Description: A problem was detected during Power-On Self-Test (POST).
-------------------------------------------------------------------------------
Record:      15
Date/Time:   06/10/2024 14:43:06
Source:      system
Severity:    Ok
Description: An unsupported event occurred.
-------------------------------------------------------------------------------

VMs are currently being migrated off the node, when that is done, I'll reassign to DC ops for checking on the hardware.

Event Timeline

All VMs moved off the server. DC ops, can you please have a look? Not sure what "unsupported event" means, never seen that before? Server can be powered off for analysis any time.

This server is out of warranty Will check decom servers to see if we have any suitable dimms

The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.

DIMM B1

BankLabel:
B
CacheSize:
Information Not Available
CurrentOperatingSpeed:
2400 MHz
DeviceDescription:
DIMM B1
DeviceType:
Memory
FQDD:
DIMM.Socket.B1
InstanceID:
DIMM.Socket.B1
LastSystemInventoryTime:
2024-06-10T19:54:17
LastUpdateTime:
2022-02-02T00:40:40
ManufactureDate:
Mon Sep 10 07:00:00 2018 UTC
Manufacturer:
Micron Technology
MemoryTechnology:
DRAM
MemoryType:
DDR-4
Model:
DDR4 DIMM
NonVolatileSize:
Information Not Available
PartNumber:
36ASF4G72PZ-2G6E1
PrimaryStatus:
Ok
Rank:
Double Rank
RemainingRatedWriteEndurance:
Information Not Available
SerialNumber:
1E5C734E
Size:
32768 MB
Speed:
2666 MHz
SystemEraseCapability:
Not Supported
VolatileSize:
32768 MB

@MoritzMuehlenhoff Can i take server down to replace dimm?

Yes, please! All VMs have been moved off the host.

@MoritzMuehlenhoff Replaced Dimm.

Also updated idrac firmware /bios

T367075 was auto generated also for degraded raid

CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdd1[2] sdb1[1] sdc1[3] sda1[0]

1456128 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid5 sdb2[1] sda2[0] sdc2[3]

117086208 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]

md2 : active (auto-read-only) raid5 sda3[0] sdb3[1] sdc3[3]

2225184768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
bitmap: 0/6 pages [0KB], 65536KB chunk

unused devices: <none>

@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this

You are in emergency mode. After logging in, type "journalctl -xGive root passwe
(or press Control-D to continue):

@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this

Thanks for the fixes! I'll just reimage, the system had no VMs any more anyway.

Mentioned in SAL (#wikimedia-operations) [2024-06-12T06:55:29Z] <moritzm> remove ganeti1019 from eqiad cluster T367071

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye

@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this

You are in emergency mode. After logging in, type "journalctl -xGive root passwe
(or press Control-D to continue):

I've attempted a reimage, but it again shows a failed drive now. There's probably something more fundamental broken with the server, but let's not spend more trying to fix it. The refresh hosts should be arriving within the next weeks, so I'll instead go ahead and decom ganeti1019 after all.