Page MenuHomePhabricator

Bad ram on db1127
Closed, ResolvedPublic

Description

mariadb crashed due to ram errors. dmesg:

[5253958.365243] MCE: Killing mysqld:31372 due to hardware memory corruption fault at 7f3acf5a4042
[5254042.655070] MCE: Killing mysqld:31372 due to hardware memory corruption fault at 7f3acf5a4042

ipmi-sel:

root@db1127:~# ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Aug-03-2021 | 17:03:53 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Sep-15-2021 | 18:19:26 | Mem ECC Warning  | Memory                      | transition to Non-Critical from OK
3   | Sep-20-2021 | 16:45:30 | Mem ECC Warning  | Memory                      | transition to Critical from less severe

Host is depooled at the dbctl level.

/admin1-> racadm getsel
Record:      1
Date/Time:   08/03/2021 17:03:53
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/15/2021 18:19:26
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/20/2021 16:45:30
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson
Resolved Kormat

Event Timeline

Change 725450 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1127: Disable notifications

https://gerrit.wikimedia.org/r/725450

Change 725450 merged by Kormat:

[operations/puppet@production] db1127: Disable notifications

https://gerrit.wikimedia.org/r/725450

Change 725451 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1127: Use new notifications-disable format

https://gerrit.wikimedia.org/r/725451

Change 725451 merged by Kormat:

[operations/puppet@production] db1127: Use new notifications-disable format

https://gerrit.wikimedia.org/r/725451

Joe triaged this task as High priority.Oct 4 2021, 5:56 AM

Updated description with idrac output.

Created a Dell dispatch ticket You have successfully submitted request SR1071944241.

@Kormat db1127 DIMM is on-site, I need to take the server offline to replace

@Cmjohnson Kormat is away today, if you give me enough time I can put if offline for you. :-)

Icinga downtime set by jynus@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: hw maintenance

db1127.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-10-06T16:35:21Z] <jynus> stopping db1127 for hw maintenance T292366

@Cmjohnson you can proceed- the host is poweredoff, according to racadm, but I didn't power it off- either it crashed or something happened before I could stop mysql cleanly. I will know more when it comes up again.

DIMM replaced, cleared the error logs, everything looks good from my end. @jynus I am resolving the task to remove from our queue.

Change 730146 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1127: Re-enable notifications

https://gerrit.wikimedia.org/r/730146

Change 730146 merged by Kormat:

[operations/puppet@production] db1127: Re-enable notifications

https://gerrit.wikimedia.org/r/730146