Page MenuHomePhabricator

cp3048 down, mgmt console not reachable
Closed, ResolvedPublic

Description

cp3048 went down and didn't recover, mgmt console seems not reachable (ssh hangs while trying to connect).

Joe already depooled it so in case it recovers all of a sudden it will not serve traffic.

Related Objects

Event Timeline

elukey created this task.Jul 20 2017, 6:02 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 6:02 AM

I had the same symptom wich oxygen a few days ago and a "racadm racreset" fixed the mgmt for me.

From ipmitool sel I got a lot of these:

7b | 07/20/2017 | 01:06:48 | Processor #0x0d | Transition to Non-recoverable | Asserted
7c | 07/20/2017 | 01:06:49 | Unknown #0x28 |  | Asserted
7d | 07/20/2017 | 01:06:49 | Unknown #0x28 |  | Asserted
7e | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
7f | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
80 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
81 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
82 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
83 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
84 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
85 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
86 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
87 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
88 | 07/20/2017 | 01:06:50 | Processor #0x0d | Transition to Non-recoverable | Asserted
89 | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
8a | 07/20/2017 | 01:06:50 | Unknown #0x28 |  | Asserted
8b | 07/20/2017 | 01:06:51 | Unknown #0x28 |  | Asserted
8c | 07/20/2017 | 01:06:51 | Unknown #0x28 |  | Asserted
8d | 07/20/2017 | 01:06:51 | Unknown #0x28 |  | Asserted
8e | 07/20/2017 | 01:06:52 | Unknown #0x28 |  | Asserted
8f | 07/20/2017 | 01:06:52 | Unknown #0x28 |  | Asserted
90 | 07/20/2017 | 01:06:52 | Unknown #0x28 |  | Asserted
91 | 07/20/2017 | 01:06:52 | Unknown #0x28 |  | Asserted
92 | 07/20/2017 | 01:06:53 | Unknown #0x28 |  | Asserted
93 | 07/20/2017 | 01:06:53 | Unknown #0x28 |  | Asserted
94 | 07/20/2017 | 01:06:53 | Unknown #0x28 |  | Asserted
95 | 07/20/2017 | 03:48:30 | Voltage #0x2c | State Asserted | Asserted
96 | 07/20/2017 | 06:02:36 | Voltage #0x2c | State Asserted | Asserted

Mentioned in SAL (#wikimedia-operations) [2017-07-20T07:41:37Z] <elukey> powercycle cp3048 - mgmt reachable - T171145

ema moved this task from Triage to Caching on the Traffic board.Jul 20 2017, 8:46 AM
ema closed this task as Resolved.Jul 20 2017, 8:58 AM
ema claimed this task.
ema added a subscriber: ema.

So as @MoritzMuehlenhoff mentioned on IRC the mgmt issues might have been due to T171041.

The host is back online and looks fine at the moment so I've repooled it. Feel free to re-open this bug if the machine goes down again of course.