Page MenuHomePhabricator

wdqs1002 does not reboot, stops at "Scanning for devices"
Closed, ResolvedPublic

Description

Just as the title says... wdqs1002.eqiad.wmnet does not come up after a routine reboot.

@Joe and @Volans did the investigation. What we know sofar:

  • server was booting in pxe, not on HDD. Luckily pxe installation did not complete due to missing swap partition.
  • LV were checked and look good (no partition erased during failed install, data is still there) (check was done by dropping to shell during the pxe installation)
  • RAM, disk and proc look good when checking in BIOS
  • post serial redirection was on the whole time, nothing strange is logged
  • there was something like no supported device found at reboot but went away to quikly

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I've disabled the lifecycle controller completely, so its not the issue. After more testing, I think the initial load into the installer here: "server was booting in pxe, not on HDD. Luckily pxe installation did not complete due to missing swap partition." When it hit that screen, I think it wiped the MBR, so while data is in place, there is no master boot record.

The system will likely require reinstall.

When it asks about no swap, you simply confirm you do not want a swap and the installer will continue.

If it's getting reinstalled, can it be also linked to T120714 - i.e. extending diskspace and making correct LVM etc.?

Please note: T120714

This should have an entirely new partman recipe designated to accomodate the new disks installed into this system previously.

So, a few points:

  • If the MBR was wiped, it was not wiped by the installation process (the MBR stays untouched until the very end of the install process, thanks god)
  • If the MBR was wiped, we can simply boot from pxe, get to the installer stops, get a shell and reinstall the bootoloader in the mbr

Since these machines hold tons of data I guess it would be *really* preferrable not to wipe them out if possible.

It's really strange that it initiated a PXE install, both wdqs* servers were rebooted w/o problems for the keyctl security bug (and no additional BIOS config were needed).

Why were we rebooting this system to start with?

AFAIK something to do with kernel upgrade, @Gehel should know the details.

The reboot was for the upgrade to Linux 4.4

@Joe FYI: I was able yesterday to get a shell in the installer and mount /dev/sda1 and there was a boot partition with 4 kernels there, so the data was not wiped out IMHO.
I was also able to see the other 2 partitions and lvdisplay was showing 3 logical volumes.

so, it could be possible that the package upgrade failed to correctly install grub to the mbr?

Change 282946 had a related patch set uploaded (by Gehel):
remove wdqs1002 from varnish during reinstall / fix

https://gerrit.wikimedia.org/r/282946

Change 282946 merged by Gehel:
remove wdqs1002 from varnish during reinstall / fix

https://gerrit.wikimedia.org/r/282946

Full reinstall is in progress as we needed to account for new disks anyway.

Change 283485 had a related patch set uploaded (by Gehel):
Revert "remove wdqs1002 from varnish during reinstall / fix"

https://gerrit.wikimedia.org/r/283485

Change 283485 merged by Gehel:
Revert "remove wdqs1002 from varnish during reinstall / fix"

https://gerrit.wikimedia.org/r/283485

Mentioned in SAL [2016-04-15T09:01:04Z] <gehel> reenabling wdqs1002 in varnish rotation after reinstall (T132387)

Smalyshev closed this task as Resolved.EditedApr 15 2016, 5:46 PM

I think this is done? Thanks @Gehel for all the work.