cp3048 went down and didn't recover, mgmt console seems not reachable (ssh hangs while trying to connect).
Joe already depooled it so in case it recovers all of a sudden it will not serve traffic.
cp3048 went down and didn't recover, mgmt console seems not reachable (ssh hangs while trying to connect).
Joe already depooled it so in case it recovers all of a sudden it will not serve traffic.
I had the same symptom wich oxygen a few days ago and a "racadm racreset" fixed the mgmt for me.
From ipmitool sel I got a lot of these:
7b | 07/20/2017 | 01:06:48 | Processor #0x0d | Transition to Non-recoverable | Asserted 7c | 07/20/2017 | 01:06:49 | Unknown #0x28 | | Asserted 7d | 07/20/2017 | 01:06:49 | Unknown #0x28 | | Asserted 7e | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 7f | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 80 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 81 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 82 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 83 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 84 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 85 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 86 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 87 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 88 | 07/20/2017 | 01:06:50 | Processor #0x0d | Transition to Non-recoverable | Asserted 89 | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 8a | 07/20/2017 | 01:06:50 | Unknown #0x28 | | Asserted 8b | 07/20/2017 | 01:06:51 | Unknown #0x28 | | Asserted 8c | 07/20/2017 | 01:06:51 | Unknown #0x28 | | Asserted 8d | 07/20/2017 | 01:06:51 | Unknown #0x28 | | Asserted 8e | 07/20/2017 | 01:06:52 | Unknown #0x28 | | Asserted 8f | 07/20/2017 | 01:06:52 | Unknown #0x28 | | Asserted 90 | 07/20/2017 | 01:06:52 | Unknown #0x28 | | Asserted 91 | 07/20/2017 | 01:06:52 | Unknown #0x28 | | Asserted 92 | 07/20/2017 | 01:06:53 | Unknown #0x28 | | Asserted 93 | 07/20/2017 | 01:06:53 | Unknown #0x28 | | Asserted 94 | 07/20/2017 | 01:06:53 | Unknown #0x28 | | Asserted 95 | 07/20/2017 | 03:48:30 | Voltage #0x2c | State Asserted | Asserted 96 | 07/20/2017 | 06:02:36 | Voltage #0x2c | State Asserted | Asserted
Mentioned in SAL (#wikimedia-operations) [2017-07-20T07:41:37Z] <elukey> powercycle cp3048 - mgmt reachable - T171145
So as @MoritzMuehlenhoff mentioned on IRC the mgmt issues might have been due to T171041.
The host is back online and looks fine at the moment so I've repooled it. Feel free to re-open this bug if the machine goes down again of course.