Page MenuHomePhabricator

pc2010 possibly broken memory
Closed, ResolvedPublic

Description

Looks like pc2010 had a memory issue:

[Mon Jul  8 15:20:13 2019] mce: [Hardware Error]: Machine check events logged
[Mon Jul  8 15:20:13 2019] mce: Uncorrected hardware memory error in user-access at 3b134a4880
[Mon Jul  8 15:20:13 2019] {1}Hardware error detected on CPU11
[Mon Jul  8 15:20:13 2019] {1}event severity: recoverable
[Mon Jul  8 15:20:13 2019] {1} Error 0, type: recoverable
[Mon Jul  8 15:20:13 2019] {1} fru_text: B1
[Mon Jul  8 15:20:13 2019] {1}  section_type: memory error
[Mon Jul  8 15:20:13 2019] {1}  error_status: 0x0000000000000400
[Mon Jul  8 15:20:13 2019] {1}  physical_address: 0x0000003b134a4880
[Mon Jul  8 15:20:13 2019] {1}  node: 2 card: 0 module: 0 rank: 0 bank: 1 row: 41698 column: 152
[Mon Jul  8 15:20:13 2019] {1}  DIMM location: not present. DMI handle: 0x0000
[Mon Jul  8 15:20:13 2019] Memory failure: 0x3b134a4: Killing mysqld:3693 due to hardware memory corruption
[Mon Jul  8 15:20:13 2019] Memory failure: 0x3b134a4: recovery action for dirty LRU page: Recovered
[Mon Jul  8 15:20:42 2019] MCE: Killing mysqld:3760 due to hardware memory corruption fault at 7f0534f68880

I rebooted it to see if it would show up on during boot and on HW logs, and it did:

UEFI0079: One or more uncorrectable Memory errors occurred in the previous
boot.
Check the System Event Log (SEL) to identify the non-functional DIMM, and then
replace the DIMM.


Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

I pressed F1 and the boot continued.

The HW logs now show the issue:

		CreationTimestamp = 20190708220429.000000-300
		ElementName = System Event Log Entry
		RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
		RecordFormat = string Description
		RecordID = 2

		CreationTimestamp = 20190709045530.000000-300
		ElementName = System Event Log Entry
		RecordData = Correctable memory error logging disabled for a memory device at location DIMM_B1.
		RecordFormat = string Description
		RecordID = 3

@Papaul should we move that DIMM to another position so we can see if it is the DIMM or the mainboard in case it happens again?

The memory accounted in the server looks correct:

root@pc2010:~# free -m
              total        used        free      shared  buff/cache   available
Mem:         257622         682      256597           9         342      255649
Swap:          7628           0        7628

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald Transcript
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2019-07-09T05:13:47Z] <marostegui> Rebooting pc2010 for a second time as per papaul's suggestion T227552

As per my chat with @Papaul I rebooted the host a second time and the previous error didn't show up.

@Papaul and myself chatted about this and the plan is to:

  • Clear logs (I just did)
  • Upgrade firmware, BIOS etc
  • Leave this task open for a week to see if it happens again and if not close it for now.

Create Dispatch: Success
You have successfully submitted request SR994795023.

Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on your Dell EMC TechDirect dashboard.

Mentioned in SAL (#wikimedia-operations) [2019-07-22T15:38:08Z] <marostegui> Stop mysql and power off pc2010 for on-site maintenance - T227552

DIMM replaced and Firmware upgrade

Before
Operating System Version
Service Tag DHPR0S2
Asset Tag DHR0S2
Express Service Code 29369346482
BIOS Version 1.5.4
Lifecycle Controller Firmware 3.21.21.21
System Revision I
IDSDM Firmware Version N/A

After
Service Tag DHPR0S2
Asset Tag DHR0S2
Express Service Code 29369346482
BIOS Version 2.2.11
Lifecycle Controller Firmware 3.34.34.34
System Revision I
IDSDM Firmware Version N/A

Thanks Papaul.
Memory looking good.

root@pc2010:~# free -m
              total        used        free      shared  buff/cache   available
Mem:         257392         674      256516           9         201      255498
Swap:          7628           0        7628

I have also upgraded the kernel too.
Going to start MySQL and all that jazz

MySQL caught up - all good. Thanks again Papaul!

Marostegui reassigned this task from Marostegui to Papaul.

Looks like this happened again and mysql crashed:
@Papaul could this be the memory slot? Should we swap the DIMM with another existing DIMM and see if the same error happens meaning it is the slot, or a new DIMM is reported meaning it is the DIMM itself (even though it is supposed to be new?)

-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/24/2019 08:20:41
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/24/2019 08:20:42
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   07/24/2019 08:20:42
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   07/24/2019 08:20:43
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   07/24/2019 08:20:44
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   07/24/2019 08:20:44
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   07/24/2019 08:20:45
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   07/24/2019 08:20:46
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   07/24/2019 08:20:47
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   07/24/2019 08:20:55
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   07/24/2019 08:20:55
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   07/24/2019 08:21:03
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   07/24/2019 08:21:04
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   07/24/2019 08:21:05
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   07/24/2019 08:21:06
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   07/24/2019 08:21:06
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   07/24/2019 08:21:07
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   07/24/2019 08:21:07
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   07/24/2019 08:21:09
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   07/24/2019 08:21:09
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   07/24/2019 08:21:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_B1.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   07/24/2019 08:21:11
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B1.
-------------------------------------------------------------------------------

And the OS logs:

[Wed Jul 24 08:15:19 2019] soft offline: 0x237c882 page already poisoned
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]: event severity: corrected
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:  Error 0, type: corrected
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:  fru_text: B1
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:   section_type: memory error
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:   physical_address: 0x000000237c882080
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:   node: 2 card: 0 module: 0 rank: 0 bank: 0 device: 0 row: 8594 column: 992
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:   error_type: 3, multi-bit ECC
[Wed Jul 24 08:15:28 2019] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[Wed Jul 24 08:15:46 2019] MCE: Killing mysqld:2375 due to hardware memory corruption fault at 7f50fa64c080

This host crashed again, this time it was totally frozen and I had to reset it via idrac.
These are the HW logs, same issue:

-------------------------------------------------------------------------------
Record:      52
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      55
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      56
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      57
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      60
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      61
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      62
Date/Time:   07/25/2019 03:36:45
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      63
Date/Time:   07/25/2019 03:36:46
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      64
Date/Time:   07/25/2019 03:39:11
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      65
Date/Time:   07/25/2019 03:39:11
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Marostegui raised the priority of this task from Medium to High.Jul 26 2019, 5:13 AM

MySQL crashed again:

[Thu Jul 25 16:10:27 2019] mce: Uncorrected hardware memory error in user-access at 336c902080
[Thu Jul 25 16:10:27 2019] {1}Hardware error detected on CPU4
[Thu Jul 25 16:10:27 2019] {1}event severity: recoverable
[Thu Jul 25 16:10:27 2019] {1} Error 0, type: recoverable
[Thu Jul 25 16:10:27 2019] {1}  section_type: general processor error
[Thu Jul 25 16:10:27 2019] {1}  processor_type: 0, IA32/X64
[Thu Jul 25 16:10:27 2019] {1}  processor_isa: 2, X64
[Thu Jul 25 16:10:27 2019] {1}  error_type: 0x01
[Thu Jul 25 16:10:27 2019] {1}  cache error
[Thu Jul 25 16:10:27 2019] {1}  operation: 0, unknown or generic
[Thu Jul 25 16:10:27 2019] {1}  version_info: 0x0000000000050654
[Thu Jul 25 16:10:27 2019] {1}  processor_id: 0x0000000000000002
[Thu Jul 25 16:10:27 2019] Memory failure: 0x336c902: Killing mysqld:3138 due to hardware memory corruption
[Thu Jul 25 16:10:27 2019] Memory failure: 0x336c902: recovery action for dirty LRU page: Recovered
[Thu Jul 25 16:10:57 2019] MCE: Killing mysqld:3234 due to hardware memory corruption fault at 7fb45f7df080
[Thu Jul 25 18:04:05 2019] perf: interrupt took too long (5012 > 4995), lowering kernel.perf_event_max_sample_rate to 39750
[Fri Jul 26 03:55:06 2019] perf: interrupt took too long (6272 > 6265), lowering kernel.perf_event_max_sample_rate to 31750

This time the host itself didn't crash, but logged HW errors at the same time:
I am increasing the severity to "High" as this is now happening several times a day

-------------------------------------------------------------------------------
Record:      80
Date/Time:   07/25/2019 16:12:03
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      81
Date/Time:   07/25/2019 16:12:03
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      82
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      83
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      84
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      85
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      86
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      87
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      88
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      89
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      90
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      91
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      92
Date/Time:   07/25/2019 16:12:04
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

Another crash just happened:

[Mon Jul 29 04:55:14 2019] mce: [Hardware Error]: Machine check events logged
[Mon Jul 29 04:55:14 2019] mce: Uncorrected hardware memory error in user-access at 331c802080
[Mon Jul 29 04:55:14 2019] {2}Hardware error detected on CPU23
[Mon Jul 29 04:55:14 2019] {2}event severity: recoverable
[Mon Jul 29 04:55:14 2019] {2} Error 0, type: recoverable
[Mon Jul 29 04:55:14 2019] {2}  section_type: general processor error
[Mon Jul 29 04:55:14 2019] {2}  processor_type: 0, IA32/X64
[Mon Jul 29 04:55:14 2019] {2}  processor_isa: 2, X64
[Mon Jul 29 04:55:14 2019] {2}  error_type: 0x01
[Mon Jul 29 04:55:14 2019] {2}  cache error
[Mon Jul 29 04:55:14 2019] {2}  operation: 0, unknown or generic
[Mon Jul 29 04:55:14 2019] {2}  version_info: 0x0000000000050654
[Mon Jul 29 04:55:14 2019] {2}  processor_id: 0x0000000000000036
[Mon Jul 29 04:55:14 2019] Memory failure: 0x331c802: Killing mysqld:135948 due to hardware memory corruption
[Mon Jul 29 04:55:14 2019] Memory failure: 0x331c802: recovery action for dirty LRU page: Recovered
[Mon Jul 29 04:55:49 2019] MCE: Killing mysqld:136043 due to hardware memory corruption fault at 7f110cc8c080

And the HW logs matches it:

-------------------------------------------------------------------------------
Record:      93
Date/Time:   07/29/2019 05:08:48
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      79
Date/Time:   07/25/2019 04:53:41
Source:      system
Severity:    Critical
Description: Correctable memory error logging disabled for a memory device at location DIMM_B1.
-------------------------------------------------------------------------------

Swapped DIMM B1 with DIMM A1 to see if we have the same problem on DIMM A1 if we do, we will have to replace he main-board.
@Marostegui Please let me know it the system crash again .

Thanks

Swapped DIMM B1 with DIMM A1 to see if we have the same problem on DIMM A1 if we do, we will have to replace he main-board.
@Marostegui Please let me know it the system crash again .

Thanks

Thanks @Papaul - going to start MySQL and let's see what happens in the next few days

@Marostegui This system crashed again . This time the error is on DIMM A1 see below.

	"Correctable memory error logging disabled for a memory device at location DIMM_A1. 	Mon 29 Jul 2019 21:18:34"

I will open a case with Dell to request a main board replacement.

@Papaul interesting, that crash didn't make MySQL or the host to frozen this time. Good catch!
It did kill other processes:

[Tue Jul 30 00:47:38 2019] mce: Uncorrected hardware memory error in user-access at 318802080
[Tue Jul 30 00:47:38 2019] {1}Hardware error detected on CPU28
[Tue Jul 30 00:47:38 2019] {1}event severity: recoverable
[Tue Jul 30 00:47:38 2019] {1} Error 0, type: recoverable
[Tue Jul 30 00:47:38 2019] {1}  section_type: general processor error
[Tue Jul 30 00:47:38 2019] {1}  processor_type: 0, IA32/X64
[Tue Jul 30 00:47:38 2019] {1}  processor_isa: 2, X64
[Tue Jul 30 00:47:38 2019] {1}  error_type: 0x01
[Tue Jul 30 00:47:38 2019] {1}  cache error
[Tue Jul 30 00:47:38 2019] {1}  operation: 0, unknown or generic
[Tue Jul 30 00:47:38 2019] {1}  version_info: 0x0000000000050654
[Tue Jul 30 00:47:38 2019] {1}  processor_id: 0x0000000000000002
[Tue Jul 30 00:47:38 2019] Memory failure: 0x318802: Killing prometheus-pupp:117891 due to hardware memory corruption
[Tue Jul 30 00:47:38 2019] Memory failure: 0x318802: recovery action for dirty LRU page: Recovered

@Marostegui I was wrong on my comment about the main board. We replace DIMM B1 on July 22nd and the sever crashed again on the 24th. On the 29th i swapped B1 with A1 and on the 30th the server crashed again but the error now was on A1 no longer on B1 so the DIMM that DELL sent to us was bad not the main board so i requested another DIMM .

Dell EMC Case SR 31497223

Ah, thanks for the heads up. Better then, it is easier to replace the DIMM than the main board :)

Mentioned in SAL (#wikimedia-operations) [2019-07-31T16:12:45Z] <marostegui> Poweroff pc2010 for on-site maintenance T227552

Thanks @Papaul
I have started MySQL again, let's monitor the host for a few days

I checked IDRAC logs this morning, all looks good so far

Hehe, yeah, I checked too. Let's give it till Monday
Cross your fingers!

No OS or idrac errors since the memory was replaced, so I am closing this as resolved.
If it happens again, I will re-open

Thanks @Papaul!