cp1083 crashed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	May 6 2019, 2:47 PM

Description

cp1083 suddenly crashed today at 14:37. Nothing in console. It came back online fine after a power cycle. We should check and see if it has any hardware issues.

SEL was apparently cleared today too, though it's not clear why and by whom.

/admin1-> racadm getsel
Record:      1
Date/Time:   06/06/2018 13:30:25
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
/admin1->

Event Timeline

• ema created this task.May 6 2019, 2:47 PM

Restricted Application added a project: SRE. · View Herald TranscriptMay 6 2019, 2:47 PM

• ema triaged this task as Medium priority.May 6 2019, 2:48 PM

• ema moved this task from Backlog to Hardware on the Traffic board.May 7 2019, 8:21 AM

• ema updated the task description. (Show Details)May 7 2019, 8:31 AM

• ema updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-05-07T08:39:28Z] <ema> repool cp1083 T222620

Interestingly, there was a memory usage spike right before the host crashed.

Screenshot from 2019-05-07 10-42-11.png (1×2 px, 265 KB)

In T222620#5163577, @ema wrote:

Interestingly, there was a memory usage spike right before the host crashed.

I think that is just a strange monitoring artifact. If you zoom in more closely you see the 'cached' timeseries get duplicated for some reason, while the values of each individual timeseries stay the same.

In T222620#5164117, @CDanis wrote:

In T222620#5163577, @ema wrote:

Interestingly, there was a memory usage spike right before the host crashed.

I think that is just a strange monitoring artifact. If you zoom in more closely you see the 'cached' timeseries get duplicated for some reason, while the values of each individual timeseries stay the same.

I agree on the monitoring artifact, it looks like due to our use of old/new names for node_exporter metrics new or old expressions via recording rules. My guess is that the host went offline before prometheus could evaluate both the "real" metric from the host and the recording rules one.

It's been up for ~15 days now without incident, but depooled for frontend traffic. Re-pooling it today to see if we can get a recurrence or not.

Nevermind, apparently it was already repooled, looking at the wrong thing here...

• Cmjohnson moved this task from Backlog to Stalled on the ops-eqiad board.May 28 2019, 2:55 PM

The host has been in production for weeks without issues now. Closing.

	F28951427: Screenshot from 2019-05-07 10-42-11.png
	May 7 2019, 8:42 AM

	F28953344: image.png
	May 7 2019, 12:10 PM

cp1083 crashedClosed, ResolvedPublicActions

Description

Event Timeline

cp1083 crashed
Closed, ResolvedPublic
Actions