Page MenuHomePhabricator

mw1041 has hardware issues
Closed, DuplicatePublic

Description

mw1041 has hardware issues and has shut down twice today. Inspecting the logs from before the last crash:

Nov 19 13:38:13 mw1041 kernel: [597479.852774] CMCI storm detected: switching to poll mode
Nov 19 13:38:43 mw1041 kernel: [597509.855137] CMCI storm subsided: switching to interrupt mode
Nov 19 13:39:15 mw1041 kernel: [597541.540792] mce_notify_irq: 20 callbacks suppressed
Nov 19 13:39:15 mw1041 kernel: [597541.540797] mce: [Hardware Error]: Machine check events logged
Nov 19 13:39:19 mw1041 kernel: [597546.235803] mce: [Hardware Error]: Machine check events logged

which were going on for days.

Looking inside mcelog we see:

TIME 1448029050 Fri Nov 20 14:17:30 2015
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCi_ADDR register valid
MCA: Instruction CACHE Level-0 Instruction-Fetch Error
STATUS d400010000040150 MCGSTATUS 0
MCGCAP 1c09 APICID 10 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
Hardware event. This is not a software error.
MCE 0
CPU 7 BANK 2 
ADDR 304c5b0

always for Cpu 7, bank 2. It makes me guess we have a damaged RAM.

As suggested, this server is way out of warranty and we might consider decommissioning.

Event Timeline

Joe created this task.Nov 20 2015, 3:51 PM
Joe raised the priority of this task from to Needs Triage.
Joe updated the task description. (Show Details)
Joe added projects: Operations, ops-eqiad.
Joe added a subscriber: Joe.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 20 2015, 3:51 PM
Joe updated the task description. (Show Details)Nov 20 2015, 3:58 PM
Joe set Security to None.

Change 254417 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: decommission mw1041

https://gerrit.wikimedia.org/r/254417

Change 254417 merged by Giuseppe Lavagetto:
mediawiki: decommission mw1041

https://gerrit.wikimedia.org/r/254417