Page MenuHomePhabricator

mw1280 crashed logging correctable memory errors
Closed, ResolvedPublic

Description

mw1280 crashed, mgmt console stuck, I had to powercycle it. The racadm getsel logged a ton of OEM errors ending up in Correctable memory error rate exceeded for DIMM_B1..

Worth to note that this host has a history of problems: T218006, T195734

Event Timeline

wiki_willy subscribed.

Looks like it's out of warranty. We can purchase a replacement DIMM, but will need the correct specs to place the order.

Thanks
Willy

@wiki_willy going by dell support based on service tag
Part number: PR5D1
DIMM,32GB,2133,2RX4,8G,DDR4,R 2

wiki_willy added a subtask: Unknown Object (Task).Feb 19 2020, 9:23 PM

Created T245670 to have a replacement DIMM ordered and delivered.

Thanks,
Willy

The server went down with the following error today:

Record:      216
Date/Time:   02/20/2020 11:16:09
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

I've depooled it until the DIMM module has arrived/been swapped.

The host has been down a week, hence it has been removed from PuppetDB and the Netbox report catched it.
Updated Netbox setting it's state to Failed. Please follow https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 17 2020, 4:29 PM

Replacement Dimm has arrived

@Jclark-ctr The server is depooled, you can do the replacement any time.

Dzahn triaged this task as Medium priority.Mar 17 2020, 4:33 PM

Replaced Failed drive host booting now

Dzahn added a subscriber: Jclark-ctr.

Thanks @Jclark-ctr ! I could get it per SSH now. I'll take it to get it back into production, if you are done.

Mentioned in SAL (#wikimedia-operations) [2020-03-17T16:46:53Z] <mutante> mw1280 back after long downtime due to broken RAM, added back into puppet (T240187)

after puppet runs host was added back in Icinga.

then: CRITICAL: 944 mismatched wikiversions

after a looong scap pull it is all green now

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=mw1280&scroll=220

will pool again

17:18 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet