Page MenuHomePhabricator

mw2251 hardware error
Closed, ResolvedPublic

Description

The box rebooted on its own today at Thu Nov 16 17:51 UTC and kernel.log is full of

Nov 16 10:37:49 mw2251 kernel: [499457.998946] INFO: NMI handler (ghes_notify_nmi) took too long to run: 44.987 msecs
Nov 16 10:37:49 mw2251 kernel: [499457.999282] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Nov 16 10:37:49 mw2251 kernel: [499457.999285] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Nov 16 10:37:49 mw2251 kernel: [499457.999286] {1}[Hardware Error]: event severity: corrected
Nov 16 10:37:49 mw2251 kernel: [499457.999288] {1}[Hardware Error]:  Error 0, type: corrected
Nov 16 10:37:49 mw2251 kernel: [499457.999289] {1}[Hardware Error]:  fru_text: A1
Nov 16 10:37:49 mw2251 kernel: [499457.999291] {1}[Hardware Error]:   section_type: memory error
Nov 16 10:37:49 mw2251 kernel: [499457.999292] {1}[Hardware Error]:   error_status: 0x0000000000000400
Nov 16 10:37:49 mw2251 kernel: [499457.999294] {1}[Hardware Error]:   physical_address: 0x00000007a360b880
Nov 16 10:37:49 mw2251 kernel: [499457.999297] {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 49593 column: 392 
Nov 16 10:37:49 mw2251 kernel: [499457.999299] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Nov 16 10:37:49 mw2251 kernel: [499457.999320] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov 16 10:37:49 mw2251 kernel: [499457.999322] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Nov 16 10:37:49 mw2251 kernel: [499457.999324] EDAC sbridge MC0: TSC 1500a8ba5a3e31 
Nov 16 10:37:49 mw2251 kernel: [499457.999325] EDAC sbridge MC0: ADDR 7a360b880 
Nov 16 10:37:49 mw2251 kernel: [499457.999326] EDAC sbridge MC0: MISC 0 
Nov 16 10:37:49 mw2251 kernel: [499457.999329] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1510828669 SOCKET 0 APIC 0
Nov 16 10:37:49 mw2251 kernel: [499457.999354] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7a360b offset:0x880 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
Nov 16 10:41:32 mw2251 kernel: [499681.397550] mce: [Hardware Error]: Machine check events logged
Nov 16 11:04:45 mw2251 kernel: [501074.431830] INFO: NMI handler (ghes_notify_nmi) took too long to run: 45.328 msecs
Nov 16 11:04:45 mw2251 kernel: [501074.431840] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 55.431 msecs
Nov 16 11:04:45 mw2251 kernel: [501074.431852] INFO: NMI handler (ghes_notify_nmi) took too long to run: 45.354 msecs
Nov 16 11:04:45 mw2251 kernel: [501074.431856] perf: interrupt took too long (438446 > 6188), lowering kernel.perf_event_max_sample_rate to 250
Nov 16 11:04:45 mw2251 kernel: [501074.432068] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Nov 16 11:04:45 mw2251 kernel: [501074.432071] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Nov 16 11:04:45 mw2251 kernel: [501074.432073] {2}[Hardware Error]: event severity: corrected
Nov 16 11:04:45 mw2251 kernel: [501074.432075] {2}[Hardware Error]:  Error 0, type: corrected
Nov 16 11:04:45 mw2251 kernel: [501074.432076] {2}[Hardware Error]:  fru_text: A1
Nov 16 11:04:45 mw2251 kernel: [501074.432077] {2}[Hardware Error]:   section_type: memory error
Nov 16 11:04:45 mw2251 kernel: [501074.432079] {2}[Hardware Error]:   error_status: 0x0000000000000400
Nov 16 11:04:45 mw2251 kernel: [501074.432080] {2}[Hardware Error]:   physical_address: 0x00000007a3633a00
Nov 16 11:04:45 mw2251 kernel: [501074.432084] {2}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 49598 column: 416 
Nov 16 11:04:45 mw2251 kernel: [501074.432085] {2}[Hardware Error]:   error_type: 2, single-bit ECC
Nov 16 11:04:45 mw2251 kernel: [501074.432106] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov 16 11:04:45 mw2251 kernel: [501074.432120] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Nov 16 11:04:45 mw2251 kernel: [501074.432133] EDAC sbridge MC0: TSC 1503e3abb74ec0 
Nov 16 11:04:45 mw2251 kernel: [501074.432135] EDAC sbridge MC0: ADDR 7a3633a00 
Nov 16 11:04:45 mw2251 kernel: [501074.432136] EDAC sbridge MC0: MISC 0 
Nov 16 11:04:45 mw2251 kernel: [501074.432139] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1510830285 SOCKET 0 APIC 0
Nov 16 11:04:45 mw2251 kernel: [501074.432162] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7a3633 offset:0xa00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
Nov 16 11:05:01 mw2251 kernel: [501090.550739] mce: [Hardware Error]: Machine check events logged
Nov 16 12:14:45 mw2251 kernel: [505274.659227] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Nov 16 12:14:45 mw2251 kernel: [505274.659230] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
Nov 16 12:14:45 mw2251 kernel: [505274.659232] {3}[Hardware Error]: event severity: corrected
Nov 16 12:14:45 mw2251 kernel: [505274.659234] {3}[Hardware Error]:  Error 0, type: corrected
Nov 16 12:14:45 mw2251 kernel: [505274.659235] {3}[Hardware Error]:  fru_text: A1
Nov 16 12:14:45 mw2251 kernel: [505274.659236] {3}[Hardware Error]:   section_type: memory error
Nov 16 12:14:45 mw2251 kernel: [505274.659238] {3}[Hardware Error]:   error_status: 0x0000000000000400
Nov 16 12:14:45 mw2251 kernel: [505274.659239] {3}[Hardware Error]:   physical_address: 0x00000007a3617000
Nov 16 12:14:45 mw2251 kernel: [505274.659243] {3}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 49594 column: 768 
Nov 16 12:14:45 mw2251 kernel: [505274.659244] {3}[Hardware Error]:   error_type: 2, single-bit ECC
Nov 16 12:14:45 mw2251 kernel: [505274.659267] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov 16 12:14:45 mw2251 kernel: [505274.659269] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Nov 16 12:14:45 mw2251 kernel: [505274.659271] EDAC sbridge MC0: TSC 150c4871b2809f 
Nov 16 12:14:45 mw2251 kernel: [505274.659273] EDAC sbridge MC0: ADDR 7a3617000 
Nov 16 12:14:45 mw2251 kernel: [505274.659275] EDAC sbridge MC0: MISC 0 
Nov 16 12:14:45 mw2251 kernel: [505274.659277] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1510834485 SOCKET 0 APIC 0
Nov 16 12:14:45 mw2251 kernel: [505274.659304] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7a3617 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
Nov 16 12:14:55 mw2251 kernel: [505285.239392] mce: [Hardware Error]: Machine check events logged
Nov 16 15:32:27 mw2251 kernel: [517138.399939] INFO: NMI handler (ghes_notify_nmi) took too long to run: 145.608 msecs
Nov 16 15:32:27 mw2251 kernel: [517138.490625] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Nov 16 15:32:27 mw2251 kernel: [517138.490627] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
Nov 16 15:32:27 mw2251 kernel: [517138.490629] {4}[Hardware Error]: event severity: corrected
Nov 16 15:32:27 mw2251 kernel: [517138.490631] {4}[Hardware Error]:  Error 0, type: corrected
Nov 16 15:32:27 mw2251 kernel: [517138.490632] {4}[Hardware Error]:  fru_text: A1
Nov 16 15:32:27 mw2251 kernel: [517138.490633] {4}[Hardware Error]:   section_type: memory error
Nov 16 15:32:27 mw2251 kernel: [517138.490635] {4}[Hardware Error]:   error_status: 0x0000000000000400

Event Timeline

@Papaul: This system has been depooled (by Alex) so it can be powered down (via os commands) at any time for you to troubleshoot. Please reboot it into the Dell ePSA tests (they are built into the system, no need to usb boot anything) and run the hardware tests.

Thanks!

I've ack'ed the Icinga host down alarm with a link to this task

Step 1;
login to the IDRAC to check log files, log file is showing some memory error on DIMM_A1 " Correctable memory error rate exceeded for DIMM_A1"

Selection_011.png (576×993 px, 88 KB)

step 2:
1st ePSA test came out for error see below

Selection_012.png (442×850 px, 58 KB)

step3:
cleared log, reboot the system
step 4
2nd ePSA test came out with no errors
Selection_013.png (375×749 px, 52 KB)

step 5
running the remaining memory test will check the result once home.

Full memory test came out with no errors. I went ahead and update the IDRAC firmware as well. The system is back up online.

akosiaris claimed this task.

I am willing to bet this will show up again. Memory errors don't just go away, no matter what ePSA says. In any case I guess we can close this and reopen it when it does (might take a while)

Mentioned in SAL (#wikimedia-operations) [2017-11-17T10:49:44Z] <akosiaris> sync wmf-config/db-eqiad.php for T180724