Page MenuHomePhabricator

cp3037 is currently unreachable
Closed, ResolvedPublic

Description

cp3037 is unreachable via both production and management network interfaces.

production interface appears to be physically up as reported by the switch, but is not responding to ICMP or SSH.

management interface is currently replying to ICMP and 3way handshake on 22/tcp but we are unable to get a ssh session.

a manual power drain is required to recover access to the server.

Event Timeline

Vgutierrez triaged this task as Medium priority.Jun 12 2018, 9:33 AM
Vgutierrez moved this task from Backlog to Hardware on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2018-06-12T09:41:55Z] <vgutierrez> cp3037 has been depooled due to unknown hardware issues T196974

The host and its management interface are back online after a power drain performed by remote hands.

It seems like we're looking at a thermal issue, here are kernel logs at the time of the crash:

Jun 12 06:26:51 cp3037 kernel: [8169606.570606] CPU13: Package temperature above threshold, cpu clock throttled (total events = 1)
[...]
Jun 12 07:11:51 cp3037 kernel: [8172306.980868] CPU13: Package temperature above threshold, cpu clock throttled (total events = 181920)
Jun 12 07:11:51 cp3037 kernel: [8172306.991163] CPU13: Package temperature/speed normal

And then silence.

Vvjjkkii renamed this task from cp3037 is currently unreachable to t7aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
ema renamed this task from t7aaaaaaaa to cp3037 is currently unreachable.Jul 2 2018, 8:55 AM
ema updated the task description. (Show Details)
ema lowered the priority of this task from High to Medium.Jul 2 2018, 11:29 AM
ema added a subscriber: Aklapper.
ema claimed this task.

The host is back online and pooled.