Page MenuHomePhabricator

cp1083 crashed
Closed, ResolvedPublic

Description

cp1083 suddenly crashed today at 14:37. Nothing in console. It came back online fine after a power cycle. We should check and see if it has any hardware issues.

SEL was apparently cleared today too, though it's not clear why and by whom.

/admin1-> racadm getsel
Record:      1
Date/Time:   06/06/2018 13:30:25
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
/admin1->

Event Timeline

ema triaged this task as Medium priority.May 6 2019, 2:48 PM
ema updated the task description. (Show Details)

Interestingly, there was a memory usage spike right before the host crashed.

Screenshot from 2019-05-07 10-42-11.png (1×2 px, 265 KB)

Interestingly, there was a memory usage spike right before the host crashed.

Screenshot from 2019-05-07 10-42-11.png (1×2 px, 265 KB)

I think that is just a strange monitoring artifact. If you zoom in more closely you see the 'cached' timeseries get duplicated for some reason, while the values of each individual timeseries stay the same.

image.png (286×912 px, 39 KB)

Interestingly, there was a memory usage spike right before the host crashed.

Screenshot from 2019-05-07 10-42-11.png (1×2 px, 265 KB)

I think that is just a strange monitoring artifact. If you zoom in more closely you see the 'cached' timeseries get duplicated for some reason, while the values of each individual timeseries stay the same.

I agree on the monitoring artifact, it looks like due to our use of old/new names for node_exporter metrics new or old expressions via recording rules. My guess is that the host went offline before prometheus could evaluate both the "real" metric from the host and the recording rules one.

It's been up for ~15 days now without incident, but depooled for frontend traffic. Re-pooling it today to see if we can get a recurrence or not.

Nevermind, apparently it was already repooled, looking at the wrong thing here...

ema claimed this task.

The host has been in production for weeks without issues now. Closing.