Page MenuHomePhabricator

mr1-codfw is down
Closed, ResolvedPublic

Description

mr1-codfw has been down for the past 24h, turning the codfw management network unreachable (and/or flaky). We have gotten many thousands of alert emails for this, unfortunately.

I haven't been able to login to it via either it's regular nor its OOB hostnames, so I'm not sure what's wrong with it. It's likely hardware trouble. Please powercycle it and investigate further ASAP.

Event Timeline

Just noting I 've also tried today to connect to it and failed. Nothing more to report.

faidon renamed this task from mr1-codfw is flapping to mr1-codfw is down.Jul 4 2016, 1:23 PM
faidon updated the task description. (Show Details)

Power cycle complete on device. @faidon or @akosiaris please try to see if you can access the device.
Thanks.

The box did came back up, but booted from the backup partition. I ran request system snapshot media internal slice alternate to copy the backup to the primary and rebooted it. It seems to be back up and alarms have been cleared. Thanks @Papaul!

Power cycle complete on device. @faidon or @akosiaris please try to see if you can access the device.
Thanks.

YES! The device is operational again. There is unfortunately not much in the log files. Stopping on Jul 2 16:55:02 (last few messages from sshd).