During investigation of a startup failure it was noticed that labstore2001 has a broken disk.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Papaul | T149567 Broken disk in labstore2001 | |||
| Unknown Object (Task) |
Event Timeline
Please also have a look at labstore2003, on system boot it shows the message " All of the disks from your previous configuration are gone. If this is an unexpected message, then please power off your system and check your cables to ensure all disks are present". Might be benign/false positive, but please have a look.
labstore2001
I don't know how this system was designed in the first place but here is the problem in the RAID controller settings the option “Enable BIOS stop on Error” was checked that is the reason we were not getting to the next screen to see any error. After unchecking the option I was able to get the error saying that the system can not find the boot device. Went back in the RAID configuration notice that the boot device was set to VD 2 which is not a boot partition and there are 2 VDs : VD 0 = os1 and and VD 11 = os2
if I set the bootable VD to 0 the system boots with no problem
if I set it to VD 11 it boots to DHCP and try to install the OS
So the problem was the system couldn't find the boot device.
Note: disk in slot 5 is bad
Labstore2003 and labstore2004 are back up
Labstore2001 is still down. When @Papaul tries to ssh into management console and does console com2 it says:
Login incorrect.
Give root password for maintenance
(or type Control-D to continue):
Not sure what the next step is. cc: @MoritzMuehlenhoff @chasemp
I tried now -
madhuvishy@puppetmaster1001:~$ ssh root@labstore2001.mgmt.codfw.wmnet
root@labstore2001.mgmt.codfw.wmnet's password:
/admin1-> console com2
console: Serial Device 2 is currently in use
I thought this meant nothing, but it came up!
RECOVERY - DPKG on labstore2001 is OK: All packages OK
1:16 AM RECOVERY - configured eth on labstore2001 is OK: OK - interfaces up
1:16 AM RECOVERY - MD RAID on labstore2001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
1:16 AM RECOVERY - dhclient process on labstore2001 is OK: PROCS OK: 0 processes with command name dhclient
I logged in over the mgmt and the network interface was down, there's no clue that happened, though. I started the networked services manually and removed /run/nologin (which prevented SSH logins) and all should be fine again.
Note that the host shows also shows a warning about a high battery temperature in 3w-sas (and it's out of warranty since January 2015).
@Papaul have you replaced the failed drive and if not can you please?
I believe it can be seen via:
megacli -PDInfo -PhysDrv [32:5] -a0
I mentioned this yesterday "Re-opening this task since disk in slot 5 on labstore2001 is bad." and i open a task T149693.
Rob is already working on ordering some disk.