Page MenuHomePhabricator

cp1046 is crashing and becoming unresponsive
Closed, ResolvedPublic

Description

I rebooted it once due to this behavior and didn't see anything obvious so I left it in service and it happened again shortly thereafter. @faidon depooled on palladium and I did a hard reboot.

Suspect hardware issue

Related Objects

Event Timeline

chasemp raised the priority of this task from to High.
chasemp updated the task description. (Show Details)
chasemp added projects: acl*sre-team, Traffic.
chasemp added subscribers: chasemp, BBlack, faidon.

I don't see anything damning in dmesg and I can't find anything that is narrowing down a hardware issue.

The best though i have is MEM used seems to have spiked before the last crash:

Screen Shot 2015-09-24 at 1.28.31 PM.png (369×220 px, 23 KB)

It jumped from 3.8 G used to 168G in what seems like one polling cycle (60s I think) and then diamond stops reporting in and the server becomes unresponsive. It was also somewhat indicative of resource decay as I could telnet to varnish from LVS for a bit after ssh was hanging and icinga was freaking out and then finally that was taken out too. I don't see any OOM stuff happening though.

I committed the depool on palladium for LVS, and also depooled it in confd for cache<->cache stuff. It's crashed twice in recent history now, so there's clearly a hw problem we should address here before turning it back on.

See also T113184 for previous crash. Apparently it crashed again this morning (while still depooled). Will downtime it in icinga and start tracking down what the real problem is here in this ticket.

Checked the serial console, and it's showing this:

Alert!  System fatal error during previous boot
 PCI Express Error
 Uncorrectable Memory Error

Management Engine Mode                : Active
Management Engine Firmware Version    : 0002.0001
                           Patch      : 0005
                           Build      : 006D
 Strike the F1 key to continue, F2 to run the system setup program

@Cmjohnson can you take a look at this machine locally when you get a chance?

BBlack lowered the priority of this task from High to Medium.
BBlack added a project: ops-eqiad.

A pull of the idrac log revealed this error

Record: 50
Date/Time: 09/25/2015 02:12:31
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM1,DIMM2,DIMM3,DIMM4,DIMM5,DIMM6,DIMM7,DIMM8.

So mem errors on all DIMMs, and the other one on the console about the PCI bus. Bad CPU? Bad board?

Steps Taken

-removed all of B side Dimm and cpu
-cleared log
-rebooted

  • booted the kernel without any issues

-Swapping cpu2 with cpu1 and booted without issue
-Added B side Dimm back and rebooted and kernel loaded with OS and system is back up.

Let's see if the problem returns but right now I do not have anything definitive to report to Dell

I've left cp1046 depooled the past several days since @Cmjohnson did the hardware work above. So far it's been stable under no load. Today I've wiped it's long-term storage again (in case of missed purges, or corruption from the previous crash) and re-pooled it to see how it fares under load. Will close this if it makes it into early next week without incident.