Page MenuHomePhabricator

Rack and set up ms-be1016-1018
Closed, ResolvedPublic

Description

Received the new swift servers. Rack them in B5
Switch cfg has been finished in private vlan
DNS has been completed both production and mgmt
dhcp file update
netboot.cfg was already able to accept these
iLO has been setup
cabled
racktables completed

Event Timeline

Cmjohnson claimed this task.
Cmjohnson raised the priority of this task from to Needs Triage.
Cmjohnson updated the task description. (Show Details)
Cmjohnson subscribed.

Reassigning to @filippo for install

installing, status so far:

  1. uefi boot must be disabled in favor of legacy boot
  2. the raid array is not configured, and launching hp SSA utility from bios results in this:
Welcome to GRUB!
                
error: no such device: HPEZCD201.

I've tried to work around that and download hp ssa cli (http://downloads.linux.hp.com/SDR/repo/mcp/ubuntu/pool/non-free/hpssacli-2.0-16.0_amd64.deb) and run it in d-i shell, trying to configure each PD as its own LD in raid0:

set target ctrl slot=1
# order is important, we expect the SSD to be last (sdm/sdn)
create type=arrayr0 drivetype=sata
create type=arrayr0 drivetype=ss_sata

this works to create all expected LD, however setting which logical drive is used by the controller to boot is only supported in "offline" mode, which is what is failing in 2. above, thus the controller is trying to boot from LD 1 and failing.
I'm not sure how to fix SSA not booting above, even googling the error doesn't yield much.

tried booting SSA offline mode via virtual media however we need an ilo license

</>hpiLO-> vm cdrom insert http://208.80.154.151:9999/hpssaoffline-1.50-4.0.iso
                                                     
status=2
status_tag=COMMAND PROCESSING FAILED
error_tag=COMMAND ERROR-UNSPECIFIED
Fri Feb 27 07:06:06 2015
                        
iLO Advanced License required for this functionality.


An iLO 4 License key is required to use Virtual Media.



</>hpiLO->

@Cmjohnson was able to set LD m and LD n as primary/secondary from onsite, rebooting yields this now after "booting in legacy bios mode"

304-Keyboard or System Unit Error

I removed the keyboard from teh server and the error cleared. However when I tried to install I am getting an error installing grub on /dev/sdm

  ┌┤ [!!] Install the GRUB boot loader on a hard disk ├┐
┌──────────│                                                    │ ────────┐
│          │         Unable to install GRUB in /dev/sdm         │         │
│          │ Executing 'grub-install /dev/sdm' failed.          │         │
│          │                                                    │         │
│ Running "│ This is a fatal error.                             │         │
│          │                                                    │         │
└──────────│     <Go Back>                       <Continue>     │ ────────┘
           │                                                    │
           └────────────────────────────────────────────────────┘

I've investigated a bit more and it seems the raid controller will reoder disks as presented to the operating system, with sda / sdb being the primary and secondary drives selected in the controller. This is unfortunate because it isn't what we are expecting (i.e. that the order is left unchanged irrespective of which drive we are booting from). It seems the hp bios doesn't present all controllers drives when in legacy bios mode, but it does in uefi mode.

I wasn't able to work around this in the hp firmware (i.e. let me select logical drive 12 and 13 to boot from in bios legacy mode as opposed to just the controller) and I'm not sure it is possible at all by hp firmware.

Since our installation and swift expect the ssd to be sdm and sdn we should avoid the renaming, thus two options:

  • bios legacy boot
    • let the controller boot from logical disk 1 and 2 (sda and sdb, spinning disks)
    • have /boot on a separate raid1 spanning sda2 and sdb2 (partition 1 is for swift)
    • / can live on the usual raid1 made from sdm and sdn
    • this should be safe upon replacing sda or sdb because the software raid1 would get degraded and related alarms fire
  • uefi boot
    • this is AFAIK uncharted territory, but it would involve at least:
    • PXE-booting an uefi boot loader that is able load debian-installer
    • having two GPT partitions on sdm and sdn and grub-efi and let the bios boot from either disk in order

given we are in better control with the former option I think that's the way we should go

I've investigated a bit more and it seems the raid controller will reoder disks as presented to the operating system, with sda / sdb being the primary and secondary drives selected in the controller. This is unfortunate because it isn't what we are expecting (i.e. that the order is left unchanged irrespective of which drive we are booting from). It seems the hp bios doesn't present all controllers drives when in legacy bios mode, but it does in uefi mode.

I wasn't able to work around this in the hp firmware (i.e. let me select logical drive 12 and 13 to boot from in bios legacy mode as opposed to just the controller) and I'm not sure it is possible at all by hp firmware.

to expand on this, as I understand the situation:

  • in legacy bios mode, hp bios interface would let select one drive to boot from (IOW let the raid controller handle it via primary/secondary boot disk setting)
    • the raid controller in turn will reoder and present the selected primary/secondary boot drives as sda/sdb as seen by linux, even though we expect sdm/sdn
  • in uefi mode, hp bios interface presents all of the controller virtual disks to be booted from, however this comes with uefi caveats outlined in the previous comment

There is a third option, which is to have another profile where sda/sdb are the SSDs and the rest are spindles and configured like this everywhere (d-i, puppet, swift ring files). Tampa servers used to be configured like that, I think the config may be even in our git history.

Of all of those options I think I'd prefer a UEFI boot but as you say, that's uncharted territory. Maybe spend e.g. a day on it then fall back to one of our plan Bs?

Change 197526 had a related patch set uploaded (by Filippo Giunchedi):
swift: provision ms-be101[678]

https://gerrit.wikimedia.org/r/197526

giving uefi boot a try, pxelinux / syslinux uefi versions currently run into this: http://www.syslinux.org/archives/2015-February/023178.html after which it proceeds with loading the kernel and initrd, only to crash/reboot after (finishing?) loading the initrd. I've sent a followup message to upstream list to double check. Both pxelinux from jessie and the latest upstream git yield the same result. Not sure if this is syslinux or uefi implementation's fault, next try is with grub2 which supports efi loading

grub-efi boots fine, however (obviously) doesn't support the path prefix passed from dhcp to pxelinux and thus tries to load files relative to the tftp root directory. I've worked around this temporarily by symlinking debian-installer from jessie inside /srv/tftpboot, using this configuration in grub.cfg

serial --unit=1 --speed=115200
terminal_input serial 
terminal_output serial
terminal serial

menuentry 'Install' {         
    set background_color=black 
    linux    debian-installer/amd64/linux vga=normal auto-install/enable=true preseed/url=http://apt.wikimedia.org/autoinstall/preseed.cfg DEBCONF_DEBUG=5 netcfg/choose_interface=auto netcfg/get_hostname=unassigned netcfg/get_domain=unassigned netcfg/dhcp_timeout=60 --- console=ttyS1,115200n8              
    initrd   debian-installer/amd64/initrd.gz
}

debian-installer starts and asks to "Force UEFI installation?" after which it fails to find a GPT partition needed for UEFI to find boot files:

┌───────────┤ [!!] Partition disks ├───────────┐
│                                              │
│ No EFI partition was found.                  │
│                                              │
│ Go back to the menu and resume partitioning? │
│                                              │
│     <Go Back>              <Yes>    <No>     │
│                                              │
└──────────────────────────────────────────────┘

since we'd need to change the partition scheme anyway for swift to accommodate this partition I think we're better off to fallback to what Faidon proposed re: sda/sdb, related UEFI work is tracked in T93208

Change 197526 merged by Filippo Giunchedi:
swift: provision ms-be101[678]

https://gerrit.wikimedia.org/r/197526

Change 198222 had a related patch set uploaded (by Filippo Giunchedi):
swift: update partman allocation for HP

https://gerrit.wikimedia.org/r/198222

Change 198222 merged by Filippo Giunchedi:
swift: update partman allocation for HP

https://gerrit.wikimedia.org/r/198222

Change 198227 had a related patch set uploaded (by Filippo Giunchedi):
install-server: upgrade kernel on swift HP machines

https://gerrit.wikimedia.org/r/198227

machines are in service so this is complete, modulo remaining gerrit patches.

the implemented solution was what Faidon proposed, use sda/sdb instead of sdm/sdn for SSD disks and reflect that in puppet and the swift ring files.

Change 198227 abandoned by Filippo Giunchedi:
install-server: upgrade kernel on swift HP machines

Reason:
all packages should get upgraded, see also https://phabricator.wikimedia.org/T94177

https://gerrit.wikimedia.org/r/198227

machines in service, resolving