Page MenuHomePhabricator

hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102
Closed, ResolvedPublicRequest

Description

db1102, on reboot, provided an uncorrectable memory error, and rejected one memory stick, meaning the host started with reduced memory:

UEFI0107: One or more memory errors have occurred on memory slot: B4.
Remove input power to the system, reseat the DIMM module and restart the
system. If the issues persist, replace the faulty memory module identified in
the message.

UEFI0107: One or more memory errors have occurred on memory slot: B4.
Remove input power to the system, reseat the DIMM module and restart the
system. If the issues persist, replace the faulty memory module identified in
the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.
 

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

iDRAC IP:  10.65.2.147


Lifecycle Controller: Collecting System Inventory...
Lifecycle Controller: Done
Booting...

The host is pooled into production as it is in use for backup generation (but should be easy to depool on request).

I don't know if db1102 is under warranty- please check first on your docs. Depending on that ping me to see what is the best way forward (in the worst case scenario, we will want at least confirm the memory failure and removing the faulty stick).

Not urgent, as the host will continue in production until it can be serviced (non fatal error).

Event Timeline

Change 742153 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Reduce memory allocation for dbs at db1102 due to hw failure

https://gerrit.wikimedia.org/r/742153

Change 742153 merged by Jcrespo:

[operations/puppet@production] mariadb: Reduce memory allocation for dbs at db1102 due to hw failure

https://gerrit.wikimedia.org/r/742153

@jcrespo db1102 is out of warranty. We're starting to see this on servers from this batch, db1107 just had a memory stick replaced. @wiki_willy @RobH can we buy 1 DIMM replacement, please.

Hi @jcrespo & @LSobanski - it looks like this machine is due to be refreshed next quarter (line 97 on the procurement doc for "refresh of db[1096-1106]"). Are you ok leaving this server as is, until the refresh happens in Q3?

Thanks,
Willy

Are you ok leaving this server as is, until the refresh happens in Q3

Absolutely. Could we, however, remove the bad stick if that is not a problem? We can live with a smaller amount of memory, but my main worry is that, on next reboot the stick gets "re-added" and starts throwing memory errrors- leading to a mysql crash. I guess it could be also disabled on BIOS, but I think it may take the same amount of disruption- could we schedule the downtime (without urgency)?

Are you ok leaving this server as is, until the refresh happens in Q3

Absolutely. Could we, however, remove the bad stick if that is not a problem? We can live with a smaller amount of memory, but my main worry is that, on next reboot the stick gets "re-added" and starts throwing memory errrors- leading to a mysql crash. I guess it could be also disabled on BIOS, but I think it may take the same amount of disruption- could we schedule the downtime (without urgency)?

This seems reasonable to me! I'm assigning this over to Chris so he can pull the bad dimm. My only concern is it may want the memory to be mirrored, so we may have to pull the opposing dimm for the other CPU. We'll know if it posts with an error after the first dimm is pulled.

My only concern is it may want the memory to be mirrored

Indeed.

I definitely can see that affecting multi-channel-ness of the server- I wonder if it affects only that dimm set, or all memory?, but a priori, given it boots (with errors) with just that stick faulty- for these specific servers (database backups), I would be happy to have more ram at half the speed, than the other way round, as slow memory >> uncached requests (this may not apply for core mw dbs where high performance is required). Obviously, 100% agree and let's wait, as you said, to see how the server responds after the first dimm is removed and reevaluate then if needed to take further action.

@jcrespo Can we schedule this for Friday or would Monday be better for you? Earliest available time either day is 1500UTC

@Cmjohnson Today would be preferred, as Monday I will be off, and won't be able to put it down and back up. I will shutdown the server, and if it cannot be served, we can do it any day on or after the 9th.

The host is down and ready to be serviced, please let us know if stick could be removed successfully, or any other issue arises, as it may require puppet memory adjustments before putting it back into production.

I have marked the host as failed on netbox: https://netbox.wikimedia.org/dcim/devices/1743/

Change 745464 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add x1 to db1116 backup source host

https://gerrit.wikimedia.org/r/745464

Change 745464 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add x1 to db1116 backup source host

https://gerrit.wikimedia.org/r/745464

Change 745466 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add s2 to db1139 (backup source)

https://gerrit.wikimedia.org/r/745466

Change 745469 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover db backup generation away from db1102

https://gerrit.wikimedia.org/r/745469

Change 745469 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover db backup generation away from db1102

https://gerrit.wikimedia.org/r/745469

Change 745466 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add s2 to db1139 (backup source)

https://gerrit.wikimedia.org/r/745466

Change 745480 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add s3 to db1145 (backup source)

https://gerrit.wikimedia.org/r/745480

Change 745480 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add s3 to db1145 (backup source)

https://gerrit.wikimedia.org/r/745480

@jcrespo I found a DIMM replacement in a decom'd server. The server booted to the OS w/out issue. Resolving. the task