Page MenuHomePhabricator

Broken disk in labstore2001
Closed, ResolvedPublic

Description

During investigation of a startup failure it was noticed that labstore2001 has a broken disk.

Event Timeline

Please also have a look at labstore2003, on system boot it shows the message " All of the disks from your previous configuration are gone. If this is an unexpected message, then please power off your system and check your cables to ensure all disks are present". Might be benign/false positive, but please have a look.

labstore2001

I don't know how this system was designed in the first place but here is the problem in the RAID controller settings the option “Enable BIOS stop on Error” was checked that is the reason we were not getting to the next screen to see any error. After unchecking the option I was able to get the error saying that the system can not find the boot device. Went back in the RAID configuration notice that the boot device was set to VD 2 which is not a boot partition and there are 2 VDs : VD 0 = os1 and and VD 11 = os2
if I set the bootable VD to 0 the system boots with no problem
if I set it to VD 11 it boots to DHCP and try to install the OS

So the problem was the system couldn't find the boot device.

Note: disk in slot 5 is bad

Labstore2003 and labstore2004 are back up

Labstore2001 is still down. When @Papaul tries to ssh into management console and does console com2 it says:

Login incorrect.
Give root password for maintenance
(or type Control-D to continue):

Not sure what the next step is. cc: @MoritzMuehlenhoff @chasemp

I tried now -

madhuvishy@puppetmaster1001:~$ ssh root@labstore2001.mgmt.codfw.wmnet
root@labstore2001.mgmt.codfw.wmnet's password:
/admin1-> console com2
console: Serial Device 2 is currently in use

I thought this meant nothing, but it came up!

RECOVERY - DPKG on labstore2001 is OK: All packages OK
1:16 AM RECOVERY - configured eth on labstore2001 is OK: OK - interfaces up
1:16 AM RECOVERY - MD RAID on labstore2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
1:16 AM RECOVERY - dhclient process on labstore2001 is OK: PROCS OK: 0 processes with command name dhclient

I logged in over the mgmt and the network interface was down, there's no clue that happened, though. I started the networked services manually and removed /run/nologin (which prevented SSH logins) and all should be fine again.

Note that the host shows also shows a warning about a high battery temperature in 3w-sas (and it's out of warranty since January 2015).

I chat with Moritz on IRC he mentioned that it is okay to resolve this task.

Papaul removed Papaul as the assignee of this task.
Papaul subscribed.

Re-opening this task since disk in slot 5 on labstore2001 is bad.

MoritzMuehlenhoff renamed this task from labstore2001 doesn't boot to Broken disk in labstore2001.Nov 2 2016, 10:36 AM
MoritzMuehlenhoff assigned this task to Papaul.
MoritzMuehlenhoff triaged this task as Medium priority.
MoritzMuehlenhoff updated the task description. (Show Details)

Re-opening this task since disk in slot 5 on labstore2001 is bad.

@Papaul have you replaced the failed drive and if not can you please?

I believe it can be seen via:

megacli -PDInfo -PhysDrv [32:5] -a0

I mentioned this yesterday "Re-opening this task since disk in slot 5 on labstore2001 is bad." and i open a task T149693.
Rob is already working on ordering some disk.

chasemp added a subtask: Unknown Object (Task).Nov 2 2016, 2:25 PM

Disk has been replaced

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:55 PM