Page MenuHomePhabricator

Allow UEFI DHCP configs
Open, LowPublic

Description

It seems a good time to test if UEFI could be used in our infrastructure. The rationale is that more and more issues arose when dealing with Legacy BIOS / lpxelinux / TFTP / etc.. during the past years (see T363576 for some details). Most of the vendors asks to us if we have tested it with UEFI, since Legacy BIOS is not really fully supported anymore.

In order to be able to use UEFI, we should:

  1. Add syslinux-efi to our DHCP install servers.
  2. Verify if anything needs to be added to Spicerack's dhcp module to allow UEFI. In theory it should be a matter of configuring filename and some option pxelinux.something, that is already possible.
  3. Configure the provision cookbook to set UEFI, istead of Legacy BIOS (opt-in with a parameter).
  4. Configure the reimage cookbook to use UEFI as well (same opt-in parameter).

Once we have a running host with UEFI that works with provision and reimage we can think about next steps.

Everything is of course not taking care of the security review that will be needed if we choose UEFI, that should be on a separate task in my opinion.

Event Timeline

Add syslinux-efi to our DHCP install servers.

Syslinux development seems to have halted upstream in 2019, should we look consider using another bootloader for EFI?

Add syslinux-efi to our DHCP install servers.

Syslinux development seems to have halted upstream in 2019, should we look consider using another bootloader for EFI?

@jhathaway definitely, I thought it could have been a good start even for quick tests, but we can test something else too. We can customize what DHCP sends to the NIC doing PXE via spicerack/cookbook, so we could have multiple boot loaders for EFI installed and test them separately. Do you have any suggestion about others to pick up or test?

In spicerack we now have dhcp_options and dhcp_filename to pass to DHCPConfMac and DHCPConfOpt82. This allows us to override filename and various option $something settings in cookbooks.

Something like:

filename "efi64/syslinux.efi";
option pxelinux.pathprefix "http://XXX.XXX.XXX.XXX/efi64/";

is possible just passing the right params when creating the class. It could be used in the provision cookbook when --force-uefi is set, for example.

Change #1077377 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] sre.hosts.provision: initial UEFI support

https://gerrit.wikimedia.org/r/1077377

Change #1077497 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/cookbooks@master] sre.hosts.reimage: add UEFI HTTP Boot support

https://gerrit.wikimedia.org/r/1077497

Change #1078020 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] efi: add efi boot files on apt server

https://gerrit.wikimedia.org/r/1078020

Change #1078020 merged by JHathaway:

[operations/puppet@production] efi: add efi boot files on apt server

https://gerrit.wikimedia.org/r/1078020

Forgive the drive-by comment, but I'm wondering if we have evaluated any other NICs besides Broadcom? We've lost countless hours to their firmware bugs (at least ~100 of my team's hosts have been affected in the ~3 years I've worked here). That's a pretty significant cost if you think about our salaries, opportunity costs, etc.

Broadcom had a very low reputation at previous places I've worked, so I thought it might be helpful to evaluate Intel (or any other brand) of NIC to see what's a problem with legacy BIOS vs what's a problem with Broadcom. Apologies if this has already been considered.

@bking I think it's a question worth asking, but probably not in that task :) Could you open a dedicated one for the Procurement/DCops team?

Change #1077377 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: initial UEFI support

https://gerrit.wikimedia.org/r/1077377

Change #1077497 merged by JHathaway:

[operations/cookbooks@master] sre.hosts.reimage: add UEFI HTTP Boot support

https://gerrit.wikimedia.org/r/1077497

Change #1087865 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.reimage: fix _validate() when using UEFI

https://gerrit.wikimedia.org/r/1087865

Change #1087865 merged by Elukey:

[operations/cookbooks@master] sre.hosts.reimage: fix _validate() when using UEFI

https://gerrit.wikimedia.org/r/1087865

I found two issues while reimaging ms-be2083 (supermicro):

  • The cookbook can't recognize if it is in d-i or not, since /proc/cmdline doesn't contain "debian-installer" as expected but:
~ # cat /proc/cmdline 
linux initrd=one.gz vga=normal auto-install/enable=true preseed/url=http://apt.wikimedia.org/autoinstall/preseed.cfg DEBCONF_DEBUG=5 netcfg/choose_interface=auto netcfg/get_hostname=unassigned netcfg/get_domain=unassigned netcfg/dhcp_timeout=60 --- console=ttyS1,115200n8 raid0.default_layout=2
  • The recipe for ms-be2083 may be wrong since during boot it can't find any media present, and forces PXE again (that triggers d-i etc...).

Change #1087869 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.reimage: fix remote command to use to test if d-i started

https://gerrit.wikimedia.org/r/1087869

Change #1087869 merged by Elukey:

[operations/cookbooks@master] sre.hosts.reimage: fix remote command to use to test if d-i started

https://gerrit.wikimedia.org/r/1087869

  • The recipe for ms-be2083 may be wrong since during boot it can't find any media present, and forces PXE again (that triggers d-i etc...).

No idea what happened but I retried a reimage after https://gerrit.wikimedia.org/r/1087869 and everything went fine. So for the moment the first real test with ms-be2083 seemed a success! We'll see what DP want to do next, but we'll surely have more hosts to battle test the new UEFI support.

Change #1099740 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] partman: add recipe for UEFI 4-disk SW RAID-10

https://gerrit.wikimedia.org/r/1099740

Change #1099740 merged by Bking:

[operations/puppet@production] partman: add recipe for UEFI 4-disk SW RAID-10

https://gerrit.wikimedia.org/r/1099740

Change #1101095 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs1025: Configure partitions for UEFI

https://gerrit.wikimedia.org/r/1101095

Change #1101095 merged by Bking:

[operations/puppet@production] wdqs1025: Configure partitions for UEFI

https://gerrit.wikimedia.org/r/1101095