Page MenuHomePhabricator

db1189 broken memory
Closed, ResolvedPublic

Description

db1189 got rebooted and the HW logs show memory errors, can we get a new DIMM?

Record:      1
Date/Time:   06/26/2022 19:11:19
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/13/2022 15:44:41
Source:      system
Severity:    Critical
Description: The system memory has uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/13/2022 15:44:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   09/13/2022 15:44:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   09/13/2022 15:44:42
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   09/13/2022 15:44:42
Source:      system
Severity:    Critical
Description: The system memory has uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   09/13/2022 15:44:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   09/13/2022 15:44:42
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   09/13/2022 15:44:44
Source:      system
Severity:    Critical
Description: The system memory has uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   09/13/2022 15:44:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   09/13/2022 15:44:44
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   09/13/2022 15:44:44
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   09/13/2022 15:47:26
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   09/13/2022 15:47:26
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------

Event Timeline

Marostegui created this task.

Started mysql for now. Will do a data check but will leave the host depooled.
@Cmjohnson @Jclark-ctr once the DIMM is received and ready to be replaced, please let us know so we can power off the host for you.

wiki_willy subscribed.

@Cmjohnson - just a heads up, this was just recently installed, so it's under warranty for submitting a RMA with Dell. Thanks, Willy

Change 831922 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1189: Disable notifications

https://gerrit.wikimedia.org/r/831922

Change 831922 merged by Marostegui:

[operations/puppet@production] db1189: Disable notifications

https://gerrit.wikimedia.org/r/831922

Submitted ticket with Dell Confirmed: Service Request 151636326 was successfully submitted.

Started mysql for now. Will do a data check but will leave the host depooled.

I think mysql went down again.

Parts should be in tomorrow or Friday

Started mysql for now. Will do a data check but will leave the host depooled.

I think mysql went down again.

It did whilst working on templatelinks:

Sep 14 12:22:18 db1189 mysqld[3398]: Query (0x7ed24c017810): OPTIMIZE TABLE templatelinks

Let's not spend time with this host, we'll reclone from one that had the change already once the memory is changed.

Mentioned in SAL (#wikimedia-operations) [2022-09-15T05:12:54Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662

Mentioned in SAL (#wikimedia-operations) [2022-09-15T05:12:58Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662

@Jclark-ctr the host is powered off, you can change the memory when it arrives. Please leave it back ON when done.
Thank you!

Mentioned in SAL (#wikimedia-operations) [2022-09-15T05:32:12Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662

Mentioned in SAL (#wikimedia-operations) [2022-09-15T05:32:16Z] <marostegui@cumin1001> END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 7 days, 0:00:00 on db1189.eqiad.wmnet with reason: down T317662

Tracking for memory is today I will be on site all day and will take care of it when it arrives

Thanks - I will reclone this host now and put it back in production

@wiki_willy @Jclark-ctr looks like we are having memory issues again and the host crashed. Could it be the mainboard?

[Mon Sep 19 13:20:29 2022] EDAC MC1: 1 UE memory read error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0x4477dbf offset:0xf80 grain:32 -  err_code:0x0000:0x009f socket:0 imc:1 rank:1 bg:0 ba:3 row:0x1a06e col:0x3f8 retry_rd_err_log[0001b80f 00000000 00000200 044ac140 0001a06e] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])

And the hardware logs:

-------------------------------------------------------------------------------
Record:      80
Date/Time:   09/19/2022 13:20:27
Source:      system
Severity:    Critical
Description: The system memory has uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A10. Immediately replace the DIMM.
-------------------------------------------------------------------------------

Dell had not responded for ticket. Reopened a new ticket Confirmed: Service Request 152257151 was successfully submitted.

@Marostegui Any chance we can take this server down so i can shift memory around? to see if it follows the bad dimm to new location

Sure! I will power it off tomorrow in the EU morning and leave it off so you can change it anytime you like.

@Jclark-ctr the host is now off. Proceed as needed, thank you.

@Marostegui swapped Dimm A10 and A6 preformed hardware diagnostic on memory And pulled TSR report no errors at this time We will need to put server back into service to see if any errors return. I hopefull reseating memory fixes issue

Thank you John, I just started mysql. Closing this for now. I will reopen if this crashes again.

@Jclark-ctr the host went down again. Now DIMM A6:

-------------------------------------------------------------------------------
Record:      85
Date/Time:   09/26/2022 07:13:39
Source:      system
Severity:    Critical
Description: The system memory has uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A6. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      86
Date/Time:   09/26/2022 07:13:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      87
Date/Time:   09/26/2022 07:13:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      88
Date/Time:   09/26/2022 07:13:39
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A6. Immediately replace the DIMM.
-------------------------------------------------------------------------------
Record:      89
Date/Time:   09/26/2022 07:16:54
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      90
Date/Time:   09/26/2022 07:16:54
Source:      system
Severity:    Critical
Description: Multi-bit memory errors are detected on the memory device at location(s) DIMM_A6. Immediately replace the DIMM.
-------------------------------------------------------------------------------

@Marostegui Thank you so i did typo previous comment i had swapped with A6 I have pulled another TSR report and submitted to dell. Thank you for your assistance

Thanks John. I am leaving the host ON, but mysql stopped, so you can proceed and power it off anytime you want to swap the new DIMM.

@Jclark-ctr did Dell come back to you with any update on how to do next?

Sorry yes. Dell is shipping out another memory stick waiting on part right now

Great thank you. The host is off, so please feel free to replace it whenever you like.

Was just notified by data center of delivery from dell.

Thanks John - I will take it from here and ping you if we have more issues!