Page MenuHomePhabricator

Faulty memory on es2004 (purchase one module)
Closed, ResolvedPublic

Description

We need to acquire a memory module for es2004 https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=1393 as one is faulty and out of warranty.

I do not know the exact type and size, @Papaul probably does.

Old description:

2 out of 2 reboots for es2004.codfw.wmnet have resulted on a memory check error for DIMM B2.

The server actually has 64GB of ram, which were usable before rebooting, but due to the error, only 56GB are available:

                                                               F2 = System Setup
Phoenix ROM BIOS PLUS Version 1.10 1.8.2                   F10 = System Services
Copyright 1985-1988 Phoenix Technologies Ltd.            F11 = BIOS Boot Manager
Copyright 1990-2011 Dell Inc.                                     F12 = PXE Boot
All Rights Reserved

Dell System PowerEdge R510
www.dell.com
Testing memory.  Please wait.
Two 2.66 GHz Quad-core Processors, Bus Speed:5.86 GT/s, L2/L3 Cache:1 MB/12 MB
System running at 2.66 GHz
System Memory Size: 56.0 GB, System Memory Speed: 1067 MHz, Voltage: 1.35V


Memory Initialization Warning: Memory size may be reduced.

MEMTEST lane failure detected on DIMM B2

MEMBIST failure - The following DIMM has been disabled
by BIOS: DIMM B2
Broadcom NetXtreme II Ethernet Boot Agent v6.0.11
Copyright (C) 2000-2010 Broadcom Corporation
All rightsnreserved. on valid memory configurations, please see the Hardware
Press'Ctrl-Satooenter ConfigurationpMenuwebsite.

The server can be rebooted without problem right now (not in production).

Recommended course of action: stop the server, take the memory slot out and blow into it, retest. :-) If that doesn't work, replace slot.

After maintenance: If this is fixed, ensure to run puppet once before restarting MySQL, otherwise, less memory will be available for it, then repool it.

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added projects: ops-codfw, acl*sre-team, DBA.
jcrespo added subscribers: jcrespo, Springle.

swap the memory to another slot (from DIMM B2 to DIMM B3 and DIMM B3 to DIMM B2) now the error is not on DIMM B2 but on DIMM B3 . Bad memory the memory needs to me replaced. I will call Dell to send a replacement Memory.

papaul@papaul-XPS-L322X: ~_047.png (462×722 px, 61 KB)

jcrespo triaged this task as Medium priority.Jun 26 2015, 10:41 AM

Discussed this with Jynus on IRC , I am waiting on Jynus to get memory purchase permission.

jcrespo renamed this task from Faulty memory on es2004 to Faulty memory on es2004 (purchase one module).Jul 1 2015, 2:16 PM
jcrespo updated the task description. (Show Details)
jcrespo added a project: hardware-requests.
jcrespo set Security to None.

Can you guys paste the actual memory speed and capacity per stick? @Papaul: please advise.

Nevermind, system is online so I can poll that via software:

description: DIMM DDR3 Synchronous 1333 MHz (0.8 ns)
product: HMT31GR7BFR4A-H9
vendor: 00AD00B380AD
physical id: 6
serial: 18549C24
slot: DIMM_B3
size: 8GiB
width: 64 bits
clock: 1333MHz (0.8ns)

Chris checked his spares stock and has: 4GB 1333MHz ECC Reg Memory 2Rx8 Hynix HMT351R7BFR8A-H9 * 4

This is what I can get from the OS of existing memories on the host: DIMM DDR3 1333 MHz 8GB "Hynix Semiconductor", but I would wait for physical confirmation, if available:

$ dmidecode
[...]
Memory Device
	Array Handle: 0x1000
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: 1
	Locator: DIMM_A1 
	Bank Locator: Not Specified
	Type: DDR3
	Type Detail: Synchronous Registered (Buffered)
	Speed: 1333 MHz
	Manufacturer: 00AD00B380AD
	Serial Number: 18249C23
	Asset Tag: 01113161
	Part Number: HMT31GR7BFR4A-H9  
	Rank: 2

(wiki edit conflict :-))

My main concern is having the same amount of memory on hosts within the same service and datacenter. If we can achieve that somehow, no matter the configuration, I do not care about the physical memory placement or anything else.

We haven't migrated the entire approvals process into phabricator quite yet, so the purchase approvals are being handled via: https://rt.wikimedia.org/Ticket/Display.html?id=9467

(stealing this since I've escalated it into purchase approvals.)

RobH changed the task status from Open to Stalled.Jul 7 2015, 9:03 PM

Stalled awaiting mgmt approval for purchase on https://rt.wikimedia.org/Ticket/Display.html?id=9467

The memory for this has been ordered and is being shipped to codfw.

Assigning this task to Papaul.

@Papaul: Please shut down es2004 and replace the faulty memory. I ordered it in pairs, since you often have to replace the memory pair, not just the faulty stick.

Replaced memory in slot A2 and B2 with new memory. The server is reading now 64 GB . But at boot i am getting the message "Unsupported memory configuration. DIMM mismatch across slots detected" maybe because we are using kingston memory and hynix memory.

if this is not a problem please advance so I can close the ticket

Thanks.

I had to repool the server to mediawiki to close everithing.

MySQL sees all memory with no problem, and it is not a critical server, so we will keep an eye if any new issue appears, but I will close the issue. Thanks!