Page MenuHomePhabricator

db1262 is down
Closed, ResolvedPublic

Description

Placeholder task for investigating db1262 which stopped answering ping at 00:08. We also got paged for TransitPeeringOutboundSaturation at 00:10, but at the moment we think that's unrelated.

Event Timeline

RLazarus triaged this task as High priority.

Mentioned in SAL (#wikimedia-operations) [2025-11-06T00:29:22Z] <cdobbins@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1262.eqiad.wmnet with reason: HW issues, T409374

MariaDB crashed due to bug. We need to report this to upstream I think:

Nov 06 00:05:47 db1262 mysqld[2341753]: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fbcee85a050]
Nov 06 00:05:36 db1262 mysqld[2341753]: /opt/wmf-mariadb1011/bin/mysqld(handle_fatal_signal+0x1a3)[0x55f399d9f433]
Nov 06 00:05:34 db1262 mysqld[2341753]: /opt/wmf-mariadb1011/bin/mysqld(my_print_stacktrace+0x2e)[0x55f39a2eefee]
Nov 06 00:05:21 db1262 mysqld[2341753]: stack_bottom = 0x7f5c95258000 thread_stack 0x30000
Nov 06 00:05:21 db1262 mysqld[2341753]: 2025-11-06  0:05:21 0 [Note] Event Scheduler: Stopped
Nov 06 00:05:21 db1262 mysqld[2341753]: 2025-11-06  0:05:21 0 [Note] Event Scheduler: Waiting for the scheduler thread to reply
Nov 06 00:05:21 db1262 mysqld[2341753]: 2025-11-06  0:05:21 0 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
Nov 06 00:05:21 db1262 mysqld[2341753]: Thread pointer: 0x7f5a500015a8
Nov 06 00:05:21 db1262 mysqld[2341753]: (note: Retrieving this information may fail)
Nov 06 00:05:21 db1262 mysqld[2341753]: Attempting backtrace. Include this in the bug report.
Nov 06 00:05:21 db1262 mysqld[2341753]: Following these instructions will help MariaDB developers provide a fix quicker.
Nov 06 00:05:21 db1262 mysqld[2341753]: contains instructions to obtain a better version of the backtrace below.
Nov 06 00:05:21 db1262 mysqld[2341753]: The information page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/
Nov 06 00:05:21 db1262 mysqld[2341753]: 2025-11-06  0:05:21 0 [Note] /opt/wmf-mariadb1011/bin/mysqld (initiated by: unknown): Normal shutdown
Nov 06 00:05:21 db1262 mysqld[2341753]: Server version: 10.11.14-MariaDB-log source revision: 053f9bcb5b147bf00edb99e1310bae9125b7f125
Nov 06 00:05:21 db1262 mysqld[2341753]: information below.
Nov 06 00:05:21 db1262 mysqld[2341753]: Please include the information from the server start above, to the end of the
Nov 06 00:05:21 db1262 mysqld[2341753]: a bug on https://jira.mariadb.org/.
Nov 06 00:05:21 db1262 mysqld[2341753]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs about how to report
Nov 06 00:05:21 db1262 mysqld[2341753]: Your assistance in bug reporting will enable us to fix this for the next release.
Nov 06 00:05:21 db1262 mysqld[2341753]: Sorry, we probably made a mistake, and this is a bug.
Nov 06 00:05:21 db1262 mysqld[2341753]: 251106  0:05:21 [ERROR] /opt/wmf-mariadb1011/bin/mysqld got signal 7 ;

I haven't seen anything in kern.log so I don't think it's HW but I haven't looked too deep

It's downtimed for 3 days, I can try to restart the deamon to see if it brings it back online but I rather leave it as is to ease reporting.

Marostegui subscribed.

i will takea look but that is just a generic bug trace and I doubt they'd be able to do much with just that unless we can reproduce or report something more meaningful

Marostegui added a subscriber: wiki_willy.

The host went down, so it is not really a mariadb bug in that sense.
This is more likely to be a memory error:

2025-11-06T00:05:21.598940+00:00 db1262 kernel: [7808363.580007] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2025-11-06T00:05:21.598965+00:00 db1262 kernel: [7808363.861746] Disabling lock debugging due to kernel taint
2025-11-06T00:05:21.598966+00:00 db1262 kernel: [7808363.868784] {1}[Hardware Error]: event severity: recoverable
2025-11-06T00:05:21.617980+00:00 db1262 kernel: [7808363.868802] mce: Uncorrected hardware memory error in user-access at 9feca2a00
2025-11-06T00:05:21.618002+00:00 db1262 kernel: [7808363.874615] {1}[Hardware Error]:  Error 0, type: recoverable
2025-11-06T00:05:21.618002+00:00 db1262 kernel: [7808363.874618] {1}[Hardware Error]:  fru_text: A4
2025-11-06T00:05:21.618003+00:00 db1262 kernel: [7808363.882020] mce: [Hardware Error]: Machine check events logged
2025-11-06T00:05:21.618004+00:00 db1262 kernel: [7808363.887841] {1}[Hardware Error]:   section_type: memory error
2025-11-06T00:05:21.618004+00:00 db1262 kernel: [7808363.887843] {1}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
2025-11-06T00:05:21.618005+00:00 db1262 kernel: [7808363.887845] {1}[Hardware Error]:   physical_address: 0x00000009feca2e80
2025-11-06T00:05:21.618017+00:00 db1262 kernel: [7808363.887846] {1}[Hardware Error]:   physical_address_mask: 0xffffffffffffffc0
2025-11-06T00:05:21.618018+00:00 db1262 kernel: [7808363.887848] {1}[Hardware Error]:   node:0 card:6 module:0 rank:0 bank:8 device:0 row:19442 column:184
2025-11-06T00:05:21.628572+00:00 db1262 kernel: [7808363.892473] mce: [Hardware Error]: CPU 10: Machine Check Exception: 7 Bank 1: bd80000000100134
2025-11-06T00:05:21.628582+00:00 db1262 kernel: [7808363.898388] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
2025-11-06T00:05:21.628583+00:00 db1262 kernel: [7808363.898390] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
2025-11-06T00:05:21.689765+00:00 db1262 kernel: [7808363.953692] mce: [Hardware Error]: RIP 33:<000055f39a12ce14>
2025-11-06T00:05:21.689788+00:00 db1262 kernel: [7808363.959614] mce: [Hardware Error]: TSC 530d6eb4c066bb ADDR 9feca2a00 MISC 86 PPIN 209b3a6f9f74a4ee
2025-11-06T00:05:21.699026+00:00 db1262 kernel: [7808363.968834] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1762387521 SOCKET 0 APIC 10 microcode d000404
2025-11-06T00:05:21.710463+00:00 db1262 kernel: [7808363.978411] mce: [Hardware Error]: Machine check events logged
2025-11-06T00:05:21.710484+00:00 db1262 kernel: [7808363.980303] Memory failure: 0x9feca2: Sending SIGBUS to mysqld:2351350 due to hardware memory corruption
2025-11-06T00:05:21.710494+00:00 db1262 kernel: [7808363.980360] EDAC skx MC3: HANDLING MCE MEMORY ERROR
2025-11-06T00:05:21.728024+00:00 db1262 kernel: [7808363.989956] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2025-11-06T00:05:21.728041+00:00 db1262 kernel: [7808363.989958] EDAC skx MC3: TSC 0x0
2025-11-06T00:05:21.728043+00:00 db1262 kernel: [7808363.989959] EDAC skx MC3: ADDR 0x9feca2e80
2025-11-06T00:05:21.728046+00:00 db1262 kernel: [7808363.989960] Memory failure: 0x9feca2: recovery action for dirty LRU page: Recovered
2025-11-06T00:05:21.728047+00:00 db1262 kernel: [7808363.989960] EDAC skx MC3: MISC 0x86
2025-11-06T00:05:21.728049+00:00 db1262 kernel: [7808363.997791] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1762387521 SOCKET 0 APIC 0x0
2025-11-06T00:05:21.728051+00:00 db1262 kernel: [7808363.997799] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x9feca2 offset:0xe80 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x9feca2e80 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x12fd94580 ChannelId:0x0 RankAddress:0x97eca580 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x4bf2 Column:0xb8 Bank:0x0 BankGroup:0x2 ChipSelect:0x0 ChipId:0x0)
2025-11-06T00:05:21.728053+00:00 db1262 kernel: [7808363.997812] Memory failure: 0x9feca2: already hardware poisoned
2025-11-06T00:05:58.126323+00:00 db1262 kernel: [7808400.388284] mce: Uncorrected hardware memory error in user-access at 29fecc6b00
2025-11-06T00:05:58.126337+00:00 db1262 kernel: [7808400.388297] mce: [Hardware Error]: CPU 25: Machine Check Exception: 7 Bank 1: bd80000000100134
2025-11-06T00:05:58.126338+00:00 db1262 kernel: [7808400.388499] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2025-11-06T00:05:58.135091+00:00 db1262 kernel: [7808400.395791] mce: [Hardware Error]: RIP 33:<000055f39a307d7e>
2025-11-06T00:05:58.135098+00:00 db1262 kernel: [7808400.404563] {2}[Hardware Error]: event severity: recoverable
2025-11-06T00:05:58.135099+00:00 db1262 kernel: [7808400.404565] {2}[Hardware Error]:  Error 0, type: recoverable
2025-11-06T00:05:58.135099+00:00 db1262 kernel: [7808400.404566] {2}[Hardware Error]:  fru_text: A4
2025-11-06T00:05:58.149455+00:00 db1262 kernel: [7808400.412996]
2025-11-06T00:05:58.149465+00:00 db1262 kernel: [7808400.418914] {2}[Hardware Error]:   section_type: memory error
2025-11-06T00:05:58.149466+00:00 db1262 kernel: [7808400.418916] {2}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
2025-11-06T00:05:58.149468+00:00 db1262 kernel: [7808400.418917] {2}[Hardware Error]:   physical_address: 0x00000029fecc6b80
2025-11-06T00:05:58.149470+00:00 db1262 kernel: [7808400.418918] {2}[Hardware Error]:   physical_address_mask: 0xffffffffffffffc0
2025-11-06T00:05:58.149472+00:00 db1262 kernel: [7808400.418920] {2}[Hardware Error]:   node:0 card:6 module:0 rank:0 bank:8 device:0 row:84979 column:400
2025-11-06T00:05:58.161127+00:00 db1262 kernel: [7808400.424748] mce: [Hardware Error]: TSC 530d8828e57f99
2025-11-06T00:05:58.161140+00:00 db1262 kernel: [7808400.430580] {2}[Hardware Error]:   error_type: 3, multi-bit ECC
2025-11-06T00:05:58.161142+00:00 db1262 kernel: [7808400.430582] {2}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
2025-11-06T00:05:58.234809+00:00 db1262 kernel: [7808400.492957] ADDR 29fecc6b00 MISC 86 PPIN 20772c6f5d2e759d
2025-11-06T00:05:58.234836+00:00 db1262 kernel: [7808400.492960] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1762387557 SOCKET 1 APIC 41 microcode d000404
2025-11-06T00:05:58.234855+00:00 db1262 kernel: [7808400.504352] EDAC skx MC3: HANDLING MCE MEMORY ERROR
2025-11-06T00:05:58.234856+00:00 db1262 kernel: [7808400.504355] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2025-11-06T00:05:58.234858+00:00 db1262 kernel: [7808400.504356] EDAC skx MC3: TSC 0x0
2025-11-06T00:05:58.234860+00:00 db1262 kernel: [7808400.504357] EDAC skx MC3: ADDR 0x29fecc6b80
2025-11-06T00:05:58.234861+00:00 db1262 kernel: [7808400.504358] EDAC skx MC3: MISC 0x86
2025-11-06T00:05:58.234863+00:00 db1262 kernel: [7808400.504358] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1762387558 SOCKET 0 APIC 0x0
2025-11-06T00:05:58.234865+00:00 db1262 kernel: [7808400.504366] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x29fecc6 offset:0xb80 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x29fecc6b80 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x52fd98c80 ChannelId:0x0 RankAddress:0x297eccc80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x14bf3 Column:0x190 Bank:0x0 BankGroup:0x2 ChipSelect:0x0 ChipId:0x0)
2025-11-06T00:05:58.238072+00:00 db1262 kernel: [7808400.504940] Memory failure: 0x29fecc6: Sending SIGBUS to mysqld:2341772 due to hardware memory corruption
2025-11-06T00:05:58.253201+00:00 db1262 kernel: [7808400.514707] Memory failure: 0x29fecc6: recovery action for dirty LRU page: Recovered
2025-11-06T00:05:58.253216+00:00 db1262 kernel: [7808400.522712] Memory failure: 0x29fecc6: already hardware poisoned
Record:      30
Date/Time:   11/05/2025 23:05:35
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at A4. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
-------------------------------------------------------------------------------

@wiki_willy can we get dell to ship us a new DIMM like that error from the idrac above suggests ^

For what is worth this is a super new host T400214 - it's been in production for over just a month

Change #1202381 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1262: Disable notifications

https://gerrit.wikimedia.org/r/1202381

Marostegui lowered the priority of this task from High to Medium.Nov 6 2025, 6:58 AM

Change #1202381 merged by Marostegui:

[operations/puppet@production] db1262: Disable notifications

https://gerrit.wikimedia.org/r/1202381

I've started mariadb, but once the memory has been replaced we should just simply reclone this host.

@Jclark-ctr - can you help out @Marostegui with getting a RMA for the DIMM?

The host went down, so it is not really a mariadb bug in that sense.
This is more likely to be a memory error:

2025-11-06T00:05:21.598940+00:00 db1262 kernel: [7808363.580007] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2025-11-06T00:05:21.598965+00:00 db1262 kernel: [7808363.861746] Disabling lock debugging due to kernel taint
2025-11-06T00:05:21.598966+00:00 db1262 kernel: [7808363.868784] {1}[Hardware Error]: event severity: recoverable
2025-11-06T00:05:21.617980+00:00 db1262 kernel: [7808363.868802] mce: Uncorrected hardware memory error in user-access at 9feca2a00
2025-11-06T00:05:21.618002+00:00 db1262 kernel: [7808363.874615] {1}[Hardware Error]:  Error 0, type: recoverable
2025-11-06T00:05:21.618002+00:00 db1262 kernel: [7808363.874618] {1}[Hardware Error]:  fru_text: A4
2025-11-06T00:05:21.618003+00:00 db1262 kernel: [7808363.882020] mce: [Hardware Error]: Machine check events logged
2025-11-06T00:05:21.618004+00:00 db1262 kernel: [7808363.887841] {1}[Hardware Error]:   section_type: memory error
2025-11-06T00:05:21.618004+00:00 db1262 kernel: [7808363.887843] {1}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
2025-11-06T00:05:21.618005+00:00 db1262 kernel: [7808363.887845] {1}[Hardware Error]:   physical_address: 0x00000009feca2e80
2025-11-06T00:05:21.618017+00:00 db1262 kernel: [7808363.887846] {1}[Hardware Error]:   physical_address_mask: 0xffffffffffffffc0
2025-11-06T00:05:21.618018+00:00 db1262 kernel: [7808363.887848] {1}[Hardware Error]:   node:0 card:6 module:0 rank:0 bank:8 device:0 row:19442 column:184
2025-11-06T00:05:21.628572+00:00 db1262 kernel: [7808363.892473] mce: [Hardware Error]: CPU 10: Machine Check Exception: 7 Bank 1: bd80000000100134
2025-11-06T00:05:21.628582+00:00 db1262 kernel: [7808363.898388] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
2025-11-06T00:05:21.628583+00:00 db1262 kernel: [7808363.898390] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
2025-11-06T00:05:21.689765+00:00 db1262 kernel: [7808363.953692] mce: [Hardware Error]: RIP 33:<000055f39a12ce14>
2025-11-06T00:05:21.689788+00:00 db1262 kernel: [7808363.959614] mce: [Hardware Error]: TSC 530d6eb4c066bb ADDR 9feca2a00 MISC 86 PPIN 209b3a6f9f74a4ee
2025-11-06T00:05:21.699026+00:00 db1262 kernel: [7808363.968834] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1762387521 SOCKET 0 APIC 10 microcode d000404
2025-11-06T00:05:21.710463+00:00 db1262 kernel: [7808363.978411] mce: [Hardware Error]: Machine check events logged
2025-11-06T00:05:21.710484+00:00 db1262 kernel: [7808363.980303] Memory failure: 0x9feca2: Sending SIGBUS to mysqld:2351350 due to hardware memory corruption
2025-11-06T00:05:21.710494+00:00 db1262 kernel: [7808363.980360] EDAC skx MC3: HANDLING MCE MEMORY ERROR
2025-11-06T00:05:21.728024+00:00 db1262 kernel: [7808363.989956] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2025-11-06T00:05:21.728041+00:00 db1262 kernel: [7808363.989958] EDAC skx MC3: TSC 0x0
2025-11-06T00:05:21.728043+00:00 db1262 kernel: [7808363.989959] EDAC skx MC3: ADDR 0x9feca2e80
2025-11-06T00:05:21.728046+00:00 db1262 kernel: [7808363.989960] Memory failure: 0x9feca2: recovery action for dirty LRU page: Recovered
2025-11-06T00:05:21.728047+00:00 db1262 kernel: [7808363.989960] EDAC skx MC3: MISC 0x86
2025-11-06T00:05:21.728049+00:00 db1262 kernel: [7808363.997791] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1762387521 SOCKET 0 APIC 0x0
2025-11-06T00:05:21.728051+00:00 db1262 kernel: [7808363.997799] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x9feca2 offset:0xe80 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x9feca2e80 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x12fd94580 ChannelId:0x0 RankAddress:0x97eca580 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x4bf2 Column:0xb8 Bank:0x0 BankGroup:0x2 ChipSelect:0x0 ChipId:0x0)
2025-11-06T00:05:21.728053+00:00 db1262 kernel: [7808363.997812] Memory failure: 0x9feca2: already hardware poisoned
2025-11-06T00:05:58.126323+00:00 db1262 kernel: [7808400.388284] mce: Uncorrected hardware memory error in user-access at 29fecc6b00
2025-11-06T00:05:58.126337+00:00 db1262 kernel: [7808400.388297] mce: [Hardware Error]: CPU 25: Machine Check Exception: 7 Bank 1: bd80000000100134
2025-11-06T00:05:58.126338+00:00 db1262 kernel: [7808400.388499] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
2025-11-06T00:05:58.135091+00:00 db1262 kernel: [7808400.395791] mce: [Hardware Error]: RIP 33:<000055f39a307d7e>
2025-11-06T00:05:58.135098+00:00 db1262 kernel: [7808400.404563] {2}[Hardware Error]: event severity: recoverable
2025-11-06T00:05:58.135099+00:00 db1262 kernel: [7808400.404565] {2}[Hardware Error]:  Error 0, type: recoverable
2025-11-06T00:05:58.135099+00:00 db1262 kernel: [7808400.404566] {2}[Hardware Error]:  fru_text: A4
2025-11-06T00:05:58.149455+00:00 db1262 kernel: [7808400.412996]
2025-11-06T00:05:58.149465+00:00 db1262 kernel: [7808400.418914] {2}[Hardware Error]:   section_type: memory error
2025-11-06T00:05:58.149466+00:00 db1262 kernel: [7808400.418916] {2}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
2025-11-06T00:05:58.149468+00:00 db1262 kernel: [7808400.418917] {2}[Hardware Error]:   physical_address: 0x00000029fecc6b80
2025-11-06T00:05:58.149470+00:00 db1262 kernel: [7808400.418918] {2}[Hardware Error]:   physical_address_mask: 0xffffffffffffffc0
2025-11-06T00:05:58.149472+00:00 db1262 kernel: [7808400.418920] {2}[Hardware Error]:   node:0 card:6 module:0 rank:0 bank:8 device:0 row:84979 column:400
2025-11-06T00:05:58.161127+00:00 db1262 kernel: [7808400.424748] mce: [Hardware Error]: TSC 530d8828e57f99
2025-11-06T00:05:58.161140+00:00 db1262 kernel: [7808400.430580] {2}[Hardware Error]:   error_type: 3, multi-bit ECC
2025-11-06T00:05:58.161142+00:00 db1262 kernel: [7808400.430582] {2}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000
2025-11-06T00:05:58.234809+00:00 db1262 kernel: [7808400.492957] ADDR 29fecc6b00 MISC 86 PPIN 20772c6f5d2e759d
2025-11-06T00:05:58.234836+00:00 db1262 kernel: [7808400.492960] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1762387557 SOCKET 1 APIC 41 microcode d000404
2025-11-06T00:05:58.234855+00:00 db1262 kernel: [7808400.504352] EDAC skx MC3: HANDLING MCE MEMORY ERROR
2025-11-06T00:05:58.234856+00:00 db1262 kernel: [7808400.504355] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
2025-11-06T00:05:58.234858+00:00 db1262 kernel: [7808400.504356] EDAC skx MC3: TSC 0x0
2025-11-06T00:05:58.234860+00:00 db1262 kernel: [7808400.504357] EDAC skx MC3: ADDR 0x29fecc6b80
2025-11-06T00:05:58.234861+00:00 db1262 kernel: [7808400.504358] EDAC skx MC3: MISC 0x86
2025-11-06T00:05:58.234863+00:00 db1262 kernel: [7808400.504358] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1762387558 SOCKET 0 APIC 0x0
2025-11-06T00:05:58.234865+00:00 db1262 kernel: [7808400.504366] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x29fecc6 offset:0xb80 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x29fecc6b80 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x52fd98c80 ChannelId:0x0 RankAddress:0x297eccc80 PhysicalRankId:0x0 DimmSlotId:0x0 Row:0x14bf3 Column:0x190 Bank:0x0 BankGroup:0x2 ChipSelect:0x0 ChipId:0x0)
2025-11-06T00:05:58.238072+00:00 db1262 kernel: [7808400.504940] Memory failure: 0x29fecc6: Sending SIGBUS to mysqld:2341772 due to hardware memory corruption
2025-11-06T00:05:58.253201+00:00 db1262 kernel: [7808400.514707] Memory failure: 0x29fecc6: recovery action for dirty LRU page: Recovered
2025-11-06T00:05:58.253216+00:00 db1262 kernel: [7808400.522712] Memory failure: 0x29fecc6: already hardware poisoned
Record:      30
Date/Time:   11/05/2025 23:05:35
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at A4. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
-------------------------------------------------------------------------------

@wiki_willy can we get dell to ship us a new DIMM like that error from the idrac above suggests ^

Change #1202913 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1262: Add note about memory issues

https://gerrit.wikimedia.org/r/1202913

Change #1202913 merged by Marostegui:

[operations/puppet@production] db1262: Add note about memory issues

https://gerrit.wikimedia.org/r/1202913

Confirmed: Service Request 218364102 was successfully submitted.

This host crashed again - so I am leaving mariadb stopped as it will need to be recloned anyway.

@Marostegui I did finally get confirmation on tracking on replacement memory It should be onsite by end of day tomorrow Unless Delayed by holiday. Can i replace Memory as soon as i receive it?

@Marostegui I did finally get confirmation on tracking on replacement memory It should be onsite by end of day tomorrow Unless Delayed by holiday. Can i replace Memory as soon as i receive it?

Yes, anytime. The host is not in production at the moment and its data will be cloned.

@Marostegui it arrived yesterday afternoon will be replacing first thing this morning

@Marostegui Memory has been replaced server is back up and all yours

Thank you

Thank you - I will reclone the host

Started cloning db1241.eqiad.wmnet to db1262.eqiad.wmnet - marostegui@cumin1003

Completed depool of db1241 - Depool db1241.eqiad.wmnet to then clone it to db1262.eqiad.wmnet - marostegui@cumin1003 - marostegui@cumin1003

Start pool of db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning - marostegui@cumin1003

Completed pool of db1241 gradually with 4 steps - Pool db1241.eqiad.wmnet in after cloning - marostegui@cumin1003

Finished cloning db1241.eqiad.wmnet to db1262.eqiad.wmnet - marostegui@cumin1003

Host repooled, waiting until Monday to repool.

Change #1206062 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1262: Enable notifications

https://gerrit.wikimedia.org/r/1206062

Change #1206062 merged by Marostegui:

[operations/puppet@production] db1262: Enable notifications

https://gerrit.wikimedia.org/r/1206062

Start pool of db1262 slowly with 10 steps - Repooling after replacing the DIMM - marostegui@cumin1003

Completed pool of db1262 slowly with 10 steps - Repooling after replacing the DIMM - marostegui@cumin1003