Page MenuHomePhabricator

asw2-d2-eqiad crash
Closed, ResolvedPublic

Description

Opened JTAC case 2019-0923-0593 and provided them with logs and RSI (during/after outage).

From the faulty device's logs:

Sep 20 23:55:01  asw2-d-eqiad /usr/sbin/cron[400]: (root) CMD (   /usr/libexec/atrun)  <-- last log, routine log
Sep 21 01:28:35  asw2-d-eqiad eventd[1298]: SYSTEM_OPERATIONAL: System is operational  <-- first bootup log
[...]
Sep 21 01:28:35  asw2-d-eqiad /kernel: savecore: Reboot reason(s): 0x1: power cycle/failure

Other members only have failing keepalive and failover logs.

Asked JTAC what happened and if we should replace the device (risks of happening again).

Event Timeline

ayounsi triaged this task as High priority.Sep 23 2019, 6:07 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptSep 23 2019, 6:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi updated the task description. (Show Details)Sep 23 2019, 6:10 PM

The logs rolled over the weekend...

And neither the ones shipped to central logging nor the RSI had any useful information according to JTAC.

From there we can:
1/ replace the switch with a spare and keep that switch running with some basic monitoring in case it fails again
2/ stay as we are right now

Bstorm added a subscriber: Bstorm.Sep 24 2019, 11:45 PM
ayounsi closed this task as Resolved.Oct 1 2019, 3:27 PM

Discussed during the Monday meeting, will leave it as it.