Page MenuHomePhabricator

asw2-d2-eqiad crash
Closed, ResolvedPublic

Description

Opened JTAC case 2019-0923-0593 and provided them with logs and RSI (during/after outage).

From the faulty device's logs:

Sep 20 23:55:01  asw2-d-eqiad /usr/sbin/cron[400]: (root) CMD (   /usr/libexec/atrun)  <-- last log, routine log
Sep 21 01:28:35  asw2-d-eqiad eventd[1298]: SYSTEM_OPERATIONAL: System is operational  <-- first bootup log
[...]
Sep 21 01:28:35  asw2-d-eqiad /kernel: savecore: Reboot reason(s): 0x1: power cycle/failure

Other members only have failing keepalive and failover logs.

Asked JTAC what happened and if we should replace the device (risks of happening again).

Event Timeline

ayounsi triaged this task as High priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The logs rolled over the weekend...

And neither the ones shipped to central logging nor the RSI had any useful information according to JTAC.

From there we can:
1/ replace the switch with a spare and keep that switch running with some basic monitoring in case it fails again
2/ stay as we are right now

Discussed during the Monday meeting, will leave it as it.