Page MenuHomePhabricator

mw2206.codfw.wmnet memory issues
Closed, ResolvedPublic

Description

[Wed Feb  6 10:41:17 2019] mce: [Hardware Error]: Machine check events logged
[Wed Feb  6 10:41:17 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Wed Feb  6 10:41:17 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
[Wed Feb  6 10:41:17 2019] EDAC sbridge MC0: TSC 0
[Wed Feb  6 10:41:17 2019] EDAC sbridge MC0: ADDR 2bb8b1000
[Wed Feb  6 10:41:17 2019] EDAC sbridge MC0: MISC 90000080008228c
[Wed Feb  6 10:41:17 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1549449712 SOCKET 0 APIC 0
[Wed Feb  6 10:41:17 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)
[Wed Feb  6 13:37:43 2019] mce: [Hardware Error]: Machine check events logged
[Wed Feb  6 13:37:43 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Wed Feb  6 13:37:43 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
[Wed Feb  6 13:37:43 2019] EDAC sbridge MC0: TSC 0
[Wed Feb  6 13:37:43 2019] EDAC sbridge MC0: ADDR 2bb8b1000
[Wed Feb  6 13:37:43 2019] EDAC sbridge MC0: MISC 90000080008228c
[Wed Feb  6 13:37:43 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1549460298 SOCKET 0 APIC 0
[Wed Feb  6 13:37:43 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)

Server has not been depooled

Related Objects

Event Timeline

jijiki triaged this task as Medium priority.Feb 6 2019, 2:02 PM
jijiki created this task.
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptFeb 6 2019, 2:02 PM

next steps:

  • verify there is no ongoing SWAT or config change windows being pushed to the mw cluster
  • update the firmware of the bios
  • clear all memory errors from the SEL after copying to this task
  • run dell epsa memory testing to establish if the memory will error again easily
  • if it does error, have chris swap slots to see if the memory or the slot/mainboard is bad
  • if no error, return to service and stall this task for 14 day due date to check for failure then

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:42:11Z] <robh> mw2206.codfw.wmnet is being powered down for firmware update, relying on auto depool function from clean shutdown for mw api server via T215415

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:56:47Z] <robh> mw2206.codfw.wmnet is being powered down for firmware update, relying on auto depool function from clean shutdown for mw api server via T215415

Record:      1
Date/Time:   01/15/2015 23:20:17
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   03/08/2019 11:56:56
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A2.
-------------------------------------------------------------------------------

Ok, updating the firmware:

current bios on system: 2.3.3
newest bios version: 2.6.0

drac installed: 1.4.2.12
drac available: 2.61.60.60

I've now tried to update both the BIOS and DRAC with zero success, both stating RED007: Package verification failed. when I've downloaded for the serivce tag off support.dell.com https://www.dell.com/support/home/us/en/19/product-support/servicetag/b14k842/drivers

Perhaps its being uploaded and broken during that?

Next steps are as follows:

  • system powered back up and returned to service post attempts (done)
  • @Papaul to sync with dell support and determine why this system won't accept firmware updates (please try to update and ensure it wasn't just @RobH's connection causing issues.)
    • this host can come offline for short periods outside of swat, @Papaul can work with @RobH or another opsen to schedule this work
  • once firmware update is complete, wipe out SEL and run dell ePSA tests to test memory
    • if it errors, swap dimm slot A2 with another, note the swap on task, and we can see if the failure reoccurs on the new slot (bad dimm) or old slot (bad mainboard)
    • start hardware replacement process with dell for defective part

Actually, this is out of warranty as of Jan. 19, 2018.

So we may just want to decommission this host, unless we want to try just swapping the memory with an on site decommissioned system's memory?

IDRAC firmware complete please depool server for BIOS upgrade

Power State ON
System Model
PowerEdge R420
System Revision
I
System Host Name
Operating System
Operating System Version
Service Tag
B14K842
Express Service Code
24012733970
BIOS Version
2.3.3
Firmware Version
2.61.60.60
IP Address(es)
10.193.2.106

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:51:35Z] <jijiki> Depool mw1280 and mw2206 to hardware issues - T215415 T218006

@Papaul Server has been depooled, thank you!

Service Tag
B14K842
Express Service Code
24012733970
BIOS Version
2.6.0
Firmware Version
2.61.60.60
IP Address(es)
10.193.2.106
iDRAC MAC Address

System log clear
Server can be repool
Monitoring DIMM error

We are still having errors, I am depooling. @Papaul

[Thu Mar 14 11:56:00 2019] perf: interrupt took too long (4960 > 4946), lowering kernel.perf_event_max_sample_rate to 40250
[Thu Mar 14 13:53:36 2019] mce: [Hardware Error]: Machine check events logged
[Thu Mar 14 13:53:36 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 14 13:53:36 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
[Thu Mar 14 13:53:36 2019] EDAC sbridge MC0: TSC 0
[Thu Mar 14 13:53:36 2019] EDAC sbridge MC0: ADDR 2bb8b1000
[Thu Mar 14 13:53:36 2019] EDAC sbridge MC0: MISC 90000080008228c
[Thu Mar 14 13:53:36 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1552571617 SOCKET 0 APIC 0
[Thu Mar 14 13:53:36 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)

@jijiki the server log is not reporting any errors message since 3-13. I will go ahead and replace the memory with one of the memory from the decom servers and we will go from there.

Also the error I have here is not telling me which memory row or channel it refers to so it's difficult to tell which one to replace . The reason being maybe the memory is about to fail thats why it ls not logged into the HW log yet. I will take the system down and run memtest to see if that can help me find the bad DIMM.

memtest complete with no errors

@Papaul then the issue is somewhere else:

[Thu Mar 21 11:02:31 2019] mce: [Hardware Error]: Machine check events logged
[Thu Mar 21 11:02:31 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 21 11:02:31 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
[Thu Mar 21 11:02:31 2019] EDAC sbridge MC0: TSC 0
[Thu Mar 21 11:02:31 2019] EDAC sbridge MC0: ADDR 2bb8b1000
[Thu Mar 21 11:02:31 2019] EDAC sbridge MC0: MISC 90000080008228c
[Thu Mar 21 11:02:31 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1553166149 SOCKET 0 APIC 0
[Thu Mar 21 11:02:31 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)

What other options do we have?

It could be simply a broken CPU? If we have such the CPU type in a decom host, we could loot it from there.

The memory address is the same in all of these error reports.

That suggests to me that one of the DIMMs has a 'stuck' bit and that it is unlikely to be a CPU issue.

There are some Linux utilities that might help us map (memory controller 0, channel 1, slot 0) to a physical DIMM location. I'll take a look.

Also the error I have here is not telling me which memory row or channel it refers to so it's difficult to tell which one to replace . The reason being maybe the memory is about to fail thats why it ls not logged into the HW log yet. I will take the system down and run memtest to see if that can help me find the bad DIMM.

The original hardware log from "racadm getsel" was for DIMM_A2. Do we have a compatible DIMM module from a decommed server?

@MoritzMuehlenhoff yes we do have some. Will replace A2 once on site.

@Papaul thank you! Pooling ...

Will reopen if there are issues. Thank you all!