Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T240187 mw1280 crashed logging correctable memory errors | |||
Unknown Object (Task) |
Event Timeline
Looks like it's out of warranty. We can purchase a replacement DIMM, but will need the correct specs to place the order.
Thanks
Willy
@wiki_willy going by dell support based on service tag
Part number: PR5D1
DIMM,32GB,2133,2RX4,8G,DDR4,R 2
The server went down with the following error today:
Record: 216 Date/Time: 02/20/2020 11:16:09 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B1.
I've depooled it until the DIMM module has arrived/been swapped.
The host has been down a week, hence it has been removed from PuppetDB and the Netbox report catched it.
Updated Netbox setting it's state to Failed. Please follow https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed
Thanks @Jclark-ctr ! I could get it per SSH now. I'll take it to get it back into production, if you are done.
Mentioned in SAL (#wikimedia-operations) [2020-03-17T16:46:53Z] <mutante> mw1280 back after long downtime due to broken RAM, added back into puppet (T240187)
after puppet runs host was added back in Icinga.
then: CRITICAL: 944 mismatched wikiversions
after a looong scap pull it is all green now
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=mw1280&scroll=220
will pool again
17:18 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet