Page MenuHomePhabricator

db1155 HW memory errors
Closed, ResolvedPublic

Description

[9049296.216488] mce: Uncorrected hardware memory error in user-access at 506a064900
[9049296.216496] mce: [Hardware Error]: Machine check events logged
[9049296.219992] Memory failure: 0x506a064: Sending SIGBUS to mysqld:4072 due to hardware memory corruption
[9049296.233458] Memory failure: 0x506a064: recovery action for dirty LRU page: Recovered
[9049307.944609] MCE: Killing mysqld:4072 due to hardware memory corruption fault at 7fcfec064904
Record:      80
Date/Time:   05/19/2025 06:14:34
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B5.
-------------------------------------------------------------------------------

Details

Related Changes in Gerrit:

Event Timeline

Marostegui triaged this task as Medium priority.May 19 2025, 5:28 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Any chances we can replace this DIMM with another one from a decommissioned server?

Hey @Marostegui we certainly do! Is there a preferred time or date to swap out the memory? Let us know, thanks!

@VRiley-WMF thanks! I'll have the host ready for you tomorrow if that's ok?

@Marostegui That works for me. I'll plan for it then. Thanks!

Change #1147921 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1155: Disable notifications

https://gerrit.wikimedia.org/r/1147921

Mentioned in SAL (#wikimedia-operations) [2025-05-20T04:51:22Z] <marostegui> Stop mariadb on db1155, wiki replicas will show lag on: s2, s4, s6 and s7 T394624

Change #1147921 merged by Marostegui:

[operations/puppet@production] db1155: Disable notifications

https://gerrit.wikimedia.org/r/1147921

@VRiley-WMF db1155 is now off and ready for you to replace the memory whenever you want.

When is this expected to be solved? Because of this problem many important maintenance and monitoring tools are broken. This should have UBN priority.

When is this expected to be solved? Because of this problem many important maintenance and monitoring tools are broken. This should have UBN priority.

Please note that we are working on it as you can see above. Also, keep in mind that wiki replicas infra isn't considered mission critical environment, and lag can happen any time, while it is not ideal, it can happen (https://wikitech.wikimedia.org/wiki/Help:Wiki_Replicas#Identifying_lag). Luckily, we've been running this infra for years with pretty much no outages, but these things can happen and we work on the when we can. Generally they get resolved faster, but again, this is not a production critical environment.

VRiley-WMF changed the task status from Open to In Progress.May 21 2025, 5:27 PM

Taking this unit down for the memory swap.

This is completed

Thanks!

I started the mariadb deamons.