Page MenuHomePhabricator

asw-c8 rebooting
Closed, ResolvedPublic

Description

asw-c-eqiad have rebooted on its own 4 times so far:

May 16 16:40:35  FPC 8 removed
May 16 19:05:43  FPC 8 removed
May 16 20:27:16  FPC 8 removed
May 16 21:37:02  FPC 8 removed

this happen after logs indicating PEM1 flapping:
send: red alarm set, device Power Supply 41, reason FPC 8 PEM 1 is not powered

Current theory is that PEM0 is reporting as healthy but is not, and as PEM1 started flapping, is causing the switch member to reboot.

Production hosts on the switch stack:

cp1099
cp1055
cp1054
cp1053
cp1052
cp1051
cp1050
cp1049
cp1048
cp1047
cp1046
cp1045

paravoid> 2/4 of misc, 4/8 of text and 4/11 of upload are there.

The reboot also caused the following alarm:

Class  Description
Major  Upgrade bank is empty or corrupted for FPC 8, please do standard upgrade sequence

Fix would mean re-applying Junos 11.4R6.5 and restarting the switch member. I don't believe the alarm currently causes production issue (I saw switches running fine for a long time with that error), but the fix would cause ~10-20min downtime.

Suggested 1st step fix for the reboot issue is to replace PEM1 and PEM0.

Event Timeline

ayounsi triaged this task as High priority.May 16 2018, 9:09 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Chris replaced the 2 PEM around 22:00 UTC.

If the reboots follow the trend we should know before 23:00 UTC.

ayounsi closed this task as Resolved.EditedMay 17 2018, 4:15 AM
ayounsi claimed this task.

No more reboots, I'm going to call it solved.
We should be able to live with the junos alarm until the new cp servers are racked.