Page MenuHomePhabricator

db2097 memory errors leading to crash
Closed, ResolvedPublic

Description

Message on restart:

462 - Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4).  The
DIMM is mapped out and is currently not available.
Action: Take corrective action for the failing DIMM. Re-map all DIMMs back into
the memory map in RBSU. If the issue persists, contact support.


511 - One or more DIMMs have been mapped out due to a memory error, resulting
in an unbalanced memory configuration across memory controllers. This may
result in non-optimal memory performance.
Action: See the Integrated Management Log (IML) for information on the memory
error.  Consult documentation for memory population guidelines.

HW logs:

/system1/log1/record38
  Targets
  Properties
    number=38
    severity=Critical
    date=05/12/2020
    time=01:50:28
    description=Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000022, Bank 0x00000008, Status 0xBC000000'01010091, Address 0x00000044'78E5EF40, Misc 0x200405C2'88202086).
  Verbs


/system1/log1/record39
  Targets
  Properties
    number=39
    severity=Critical
    date=05/12/2020
    time=01:50:28
    description=DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 4)


/system1/log1/record40
  Targets
  Properties
    number=40
    severity=Critical
    date=05/12/2020
    time=01:50:52
    description=Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4).  The DIMM is mapped out and is currently not available.
  Verbs

/system1/log1/record41
  Targets
  Properties
    number=41
    severity=Informational
    date=05/12/2020
    time=01:54:52
    description=One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance.
  Verbs
    cd version exit show

/system1/log1/record42
  Targets
  Properties
    number=42
    severity=Repaired
    date=05/12/2020
    time=01:56:07
    description=HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1
  Verbs
    cd version exit show

This seems the same as T225378#5245612, but a different dimm is complaining this time.

Previous summary:

01:54 <+icinga-wm> PROBLEM - Host db2097 is DOWN: PING CRITICAL - Packet loss = 100%
01:56 <+icinga-wm> RECOVERY - Host db2097 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
01:59 <+icinga-wm> PROBLEM - MariaDB read only s6 on db2097 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
01:59 <+icinga-wm> PROBLEM - MariaDB Slave IO: s1 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
02:00 <+icinga-wm> PROBLEM - MariaDB Slave IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
02:00 <+icinga-wm> PROBLEM - mysqld processes on db2097 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
02:00 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s1 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
02:02 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s6 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
02:04 <+icinga-wm> PROBLEM - MariaDB read only s1 on db2097 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
02:10 <+icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave

Event Timeline

HW logs:

/system1/log1/record38
  Targets
  Properties
    number=38
    severity=Critical
    date=05/12/2020
    time=01:50:28
    description=Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000022, Bank 0x00000008, Status 0xBC000000'01010091, Address 0x00000044'78E5EF40, Misc 0x200405C2'88202086).
  Verbs


/system1/log1/record39
  Targets
  Properties
    number=39
    severity=Critical
    date=05/12/2020
    time=01:50:28
    description=DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 4)


/system1/log1/record40
  Targets
  Properties
    number=40
    severity=Critical
    date=05/12/2020
    time=01:50:52
    description=Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4).  The DIMM is mapped out and is currently not available.
  Verbs

/system1/log1/record41
  Targets
  Properties
    number=41
    severity=Informational
    date=05/12/2020
    time=01:54:52
    description=One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance.
  Verbs
    cd version exit show

/system1/log1/record42
  Targets
  Properties
    number=42
    severity=Repaired
    date=05/12/2020
    time=01:56:07
    description=HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1
  Verbs
    cd version exit show

The same thing happened at T225378#5303653

root@db2097:~$ date
Tue May 12 05:36:21 UTC 2020
root@db2097:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:         483434        1011      482043          17         378      480158
Swap:          7628           0        7628
462 - Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4).  The
DIMM is mapped out and is currently not available.
Action: Take corrective action for the failing DIMM. Re-map all DIMMs back into
the memory map in RBSU. If the issue persists, contact support.


511 - One or more DIMMs have been mapped out due to a memory error, resulting
in an unbalanced memory configuration across memory controllers. This may
result in non-optimal memory performance.
Action: See the Integrated Management Log (IML) for information on the memory
error.  Consult documentation for memory population guidelines.
Marostegui added subscribers: Papaul, wiki_willy.

This host is under warranty from what I can see, so maybe we should get a new memory DIMM from HP? That is what we did when it happened at T225378

I resetup the host from backups. I am going to generate a logical backup (and a snapshot will be also generated later this day) and then send this to dc ops.

Change 596171 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] icinga: Disable notifications for db2097 for maintenance

https://gerrit.wikimedia.org/r/596171

Change 596171 merged by Jcrespo:
[operations/puppet@production] icinga: Disable notifications for db2097 for maintenance

https://gerrit.wikimedia.org/r/596171

jcrespo renamed this task from db2097 (backup source) restarted itself to db2097 memory errors leading to crash.May 13 2020, 10:10 AM
jcrespo updated the task description. (Show Details)

@Papaul please helps us out. This seems like an ordinary dimm failure, but we need to do the usual swap to discard board/processor. This happened before at T225378 !

Host is up and giving service (with a memory stick automatically disabled) as we depend on it for database backups to continue - so less memory available. Let me know when you are available on dc so we can shut it down gracefully.

Papaul triaged this task as Medium priority.May 14 2020, 3:26 PM

Case Reference ID: 5347351645
Status: Case is generated and in Progress
Product: HPE ProLiant DL360 Gen10 8SFF Configure-to-order Server
Product number: 867959-B21
Serial number:
Subject: HPE ProLiant DL360 Gen10 - Bad DIMM

Will be receiving the DIMM tomorrow. The HP engineer recommended to update the firmware after the DIMM has been replaced.

Hello Papaul,

Greetings from Hewlett Packard Enterprise!

As discussed , as per the AHS logs :

Memory Failure is seen on Proc 2 DIMM 4.
Uncorrectable Machine Check exception is seen and as per that BIOS will need to be updated to at least 2.16 or later version.

Current firmware is at 2.00.
Latest firmware is 2.34

Also, iLO can also be updated to the latest.
Current iLO - 1.40.
Latest --2.15

Incremental Update is advised : URL for iLO :
https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_f0ad4364410c4ea19f326434aa

NOTE: BIOS update would require a reboot of the server , and also advised to have valid Data back up on the server.

Below is the URL for BIOS :
https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_8bf8f20d0838488e83b6b4a4ac

Incremental Update is advised :
2.00--> 2.04-->2.16--->2.30-->2.32-->2.34

Installation Instruction :

a] Download the .exe file from the above link.
b] Extract the .exe file to obtain a .signed.flash file
c] Login to the iLO. Navigate to Firmware-->Update Firmware
d] Click on choose file. Upload this .signed.flash in this location. And install the update.
e.] This will update the BIOS firmware on the server and would require a reboot for the changes to take effect.

Should you need further assistance, let us know and I will call back as soon as possible.

Memory arrived since yesterday.

No rush on our side, just the day before you are going to the DC for this, let us know so I can stop the server 24h in advance.

I will be onsite tomorrow

Will stop backup processes and stop the server.

Mentioned in SAL (#wikimedia-operations) [2020-05-26T10:18:34Z] <jynus> stop db2097 for hw maintenance T252492

$ ssh db2097.mgmt
User:root logged-in to ILOMXQ91304KD.(10.193.2.204 / FE80::8230:E0FF:FE3E:F9A2)
iLO Standard 1.40 at  Feb 05 2019
Server Name: 
Server Power: Off

host is down and ready for maintenance @Papaul.

memory replacement and firmware upgrade complete

Return label information below

Mentioned in SAL (#wikimedia-operations) [2020-05-27T08:42:19Z] <jynus> starting again db2097 db instances T252492