Page MenuHomePhabricator

db2140 crashed due to HW memory errors
Closed, ResolvedPublic

Description

db2140's mysql crashed due to HW memory errors:

mysql log trace:

Jan 03 00:32:04 db2140 mysqld[1320]: 210103  0:32:04 [ERROR] mysqld got signal 7 ;
Jan 03 00:32:04 db2140 mysqld[1320]: This could be because you hit a bug. It is also possible that this binary
Jan 03 00:32:04 db2140 mysqld[1320]: or one of the libraries it was linked against is corrupt, improperly built,
Jan 03 00:32:04 db2140 mysqld[1320]: or misconfigured. This error can also be caused by malfunctioning hardware.
Jan 03 00:32:04 db2140 mysqld[1320]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Jan 03 00:32:04 db2140 mysqld[1320]: We will try our best to scrape up some info that will hopefully help
Jan 03 00:32:04 db2140 mysqld[1320]: diagnose the problem, but since we have already crashed,
Jan 03 00:32:04 db2140 mysqld[1320]: something is definitely wrong and this may fail.
Jan 03 00:32:04 db2140 mysqld[1320]: Server version: 10.4.13-MariaDB-log
Jan 03 00:32:04 db2140 mysqld[1320]: key_buffer_size=134217728
Jan 03 00:32:04 db2140 mysqld[1320]: read_buffer_size=131072
Jan 03 00:32:04 db2140 mysqld[1320]: max_used_connections=35

And the HW memory errors on A7 memory dimm:

[5222445.069567] mce: Uncorrected hardware memory error in user-access at 4432f124c0
[5222445.069584] mce: [Hardware Error]: Machine check events logged
[5222445.069757] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[5222445.070251] Memory failure: 0x4432f12: Killing mysqld:1320 due to hardware memory corruption
[5222445.070255] Memory failure: 0x4432f12: recovery action for dirty LRU page: Recovered
[5222445.093612] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[5222445.093616] {2}[Hardware Error]: event severity: corrected
[5222445.093618] {2}[Hardware Error]:  Error 0, type: corrected
[5222445.093619] {2}[Hardware Error]:  fru_text: A7
[5222445.093620] {2}[Hardware Error]:   section_type: memory error
[5222445.093623] {2}[Hardware Error]:   error_status: 0x0000000000000400
[5222445.093626] {2}[Hardware Error]:   physical_address: 0x0000004432f124c0
[5222445.093629] {2}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 848
[5222445.093630] {2}[Hardware Error]:   error_type: 3, multi-bit ECC
[5222445.093631] {2}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[5222445.093655] mce: [Hardware Error]: Machine check events logged
[5222447.552327] MCE: Killing mysqld:1386 due to hardware memory corruption fault at 7f4a729124c0
[5227685.337025] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[5227685.337026] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[5227685.337027] {3}[Hardware Error]: event severity: corrected
[5227685.337028] {3}[Hardware Error]:  Error 0, type: corrected
[5227685.337029] {3}[Hardware Error]:  fru_text: A7
[5227685.337029] {3}[Hardware Error]:   section_type: memory error
[5227685.337030] {3}[Hardware Error]:   error_status: 0x0000000000000400
[5227685.337031] {3}[Hardware Error]:   physical_address: 0x00000043927524c0
[5227685.337032] {3}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 200
[5227685.337033] {3}[Hardware Error]:   error_type: 14, scrub uncorrected error
[5227685.337034] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[5227685.337056] mce: [Hardware Error]: Machine check events logged
[5227786.150666] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[5227786.150668] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[5227786.150669] {4}[Hardware Error]: event severity: corrected
[5227786.150670] {4}[Hardware Error]:  Error 0, type: corrected
[5227786.150670] {4}[Hardware Error]:  fru_text: A7
[5227786.150672] {4}[Hardware Error]:   section_type: memory error
[5227786.150673] {4}[Hardware Error]:   error_status: 0x0000000000000400
[5227786.150673] {4}[Hardware Error]:   physical_address: 0x00000043aa7524c0
[5227786.150675] {4}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 296
[5227786.150676] {4}[Hardware Error]:   error_type: 13, scrub corrected error
[5227786.150677] {4}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[5227786.150702] mce: [Hardware Error]: Machine check events logged
[5227954.553286] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[5227954.553288] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[5227954.553289] {5}[Hardware Error]: event severity: corrected
[5227954.553290] {5}[Hardware Error]:  Error 0, type: corrected
[5227954.553291] {5}[Hardware Error]:  fru_text: A7
[5227954.553292] {5}[Hardware Error]:   section_type: memory error
[5227954.553293] {5}[Hardware Error]:   error_status: 0x0000000000000400
[5227954.553294] {5}[Hardware Error]:   physical_address: 0x00000043d27524c0
[5227954.553296] {5}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 456
[5227954.553297] {5}[Hardware Error]:   error_type: 14, scrub uncorrected error
[5227954.553299] {5}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
[5227954.553324] mce: [Hardware Error]: Machine check events logged
[5228023.908558] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[5228023.908559] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[5228023.908560] {6}[Hardware Error]: event severity: corrected
[5228023.908562] {6}[Hardware Error]:  Error 0, type: corrected
[5228023.908562] {6}[Hardware Error]:  fru_text: A7
[5228023.908564] {6}[Hardware Error]:   section_type: memory error
[5228023.908565] {6}[Hardware Error]:   error_status: 0x0000000000000400
[5228023.908565] {6}[Hardware Error]:   physical_address: 0x00000043e2f524c0
[5228023.908567] {6}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 536
[5228023.908568] {6}[Hardware Error]:   error_type: 14, scrub uncorrected error
[5228023.908569] {6}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000

And these are the Dell HW logs which have been happening since December:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   12/06/2020 05:50:15
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
------------------------------------------------------------------------------
Record:      23
Date/Time:   01/03/2021 02:05:11
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/03/2021 00:32:12
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7.
-------------------------------------------------------------------------------

@Papaul should we exchange that DIMM with another one and wait for another failure?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Jan 4 2021, 6:29 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 654042 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2140: Disable notifications

https://gerrit.wikimedia.org/r/654042

Change 654042 merged by Marostegui:
[operations/puppet@production] db2140: Disable notifications

https://gerrit.wikimedia.org/r/654042

@Marostegui yes we can swap the DIMM and see . You can depool the server when you can and let me know.

@Marostegui
swapped A7 with B6 , clear the IDRAC log no more errors for now

Thanks Papaul! I am going to check the data and will close the task once I am done.
If it happens again we can reopen and ask Dell for a replacement

Data was checked and came back up clean.
Closing this - thanks for getting on this so fast Papaul!

The server went down again:

Record:      1
Date/Time:   01/04/2021 15:42:11
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/06/2021 09:44:16
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B6.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/06/2021 09:44:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   01/06/2021 09:44:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/06/2021 09:44:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/06/2021 09:44:16
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B6.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/06/2021 09:47:09
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/06/2021 09:47:09
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B6.
-------------------------------------------------------------------------------

Given the flaky state of the hardware I'm not powering up the server again, db2140 was depooled with dbctl.

@Papaul: Since the faulty DIMM moved around we should be eligible to get the DIMM swapped now?

Thanks Moritz.
@Papaul let me know if you need something else apart from the idrac logs to provide to Dell in order to get a replacement

Change 654735 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2140: Disable notifications

https://gerrit.wikimedia.org/r/654735

Change 654735 merged by Marostegui:
[operations/puppet@production] db2140: Disable notifications

https://gerrit.wikimedia.org/r/654735

Create Dispatch: Success
You have successfully submitted request SR1048216249.

Thank you Papaul - once it arrives, feel free to replace the DIMM (the host is off) and power it back on.

The DIMM is on site, I will replace it tomorrow once onsite.

DiMM B6 replaced , server is back up.

return tracking information below.

Thanks Papaul.
Going to start mysql, check its data, enable replication and later repool it. Will close this task once fully done

Change 656104 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2140: Enable notifications

https://gerrit.wikimedia.org/r/656104

Mentioned in SAL (#wikimedia-operations) [2021-01-14T08:42:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2140 T271084', diff saved to https://phabricator.wikimedia.org/P13764 and previous config saved to /var/cache/conftool/dbconfig/20210114-084243-marostegui.json

Change 656104 merged by Marostegui:
[operations/puppet@production] db2140: Enable notifications

https://gerrit.wikimedia.org/r/656104

Marostegui reassigned this task from Marostegui to Papaul.

Data check was ok.
Notifications enabled and host repooled.

Thanks Papaul for replacing its memory.