Page MenuHomePhabricator

add HBA355i support to installer
Closed, ResolvedPublic

Description

The newest line of PowerEdge R450 servers moved off the 'onboard sata controller' and now require the use of either a full scale Perc H750 controller (overkill) or the cheaper HBA355i (when using software raid).

We went with the HBA355i for the new R450 cp hosts, and have pending delivery of 16 of them to eqsin next week, as well as 16 of them delivered in ulsfo. This issue was discovered during setup of cp4045 via T317244.

Dell driver website for the HBA355i: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=4cjy9&oscode=naa&productcode=poweredge-r450

Event Timeline

RobH renamed this task from add hbs330 support to installer to add HBA355i support to installer.Sep 30 2022, 6:22 PM
RobH updated the task description. (Show Details)
RobH added a subscriber: MoritzMuehlenhoff.

Screen Shot 2022-09-30 at 11.17.48 AM.png (684×972 px, 118 KB)

The last time I had an issue with driver support in the installer, I recall @MoritzMuehlenhoff being the person to help me out. Moritz is this still the case, and are you able to assist with driver support for the HBA330 controller?

I did a little digging from the install_console shell on this host.

lspci output for this adapter is:

~ # lspci -v -s 65:00.0 -nn
65:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx [1000:00e6]
	DeviceName: SL3 NonRAID
	Subsystem: Dell HBA355i Front [1028:200c]
	Flags: bus master, fast devsel, latency 0, IRQ 11, NUMA node 0
	Memory at bbc00000 (64-bit, prefetchable) [size=1M]
	Memory at bbd00000 (64-bit, prefetchable) [size=1M]
	Memory at bbe00000 (32-bit, non-prefetchable) [size=1M]
	I/O ports at a000 [size=256]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [b0] MSI-X: Enable- Count=128 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [148] Power Budgeting <?>
	Capabilities: [158] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [168] Secondary PCI Express
	Capabilities: [188] Physical Layer 16.0 GT/s <?>
	Capabilities: [1b0] Lane Margining at the Receiver <?>
	Capabilities: [218] Dynamic Power Allocation <?>
	Capabilities: [248] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
	Capabilities: [348] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
	Capabilities: [380] Data Link Feature <?>
	Kernel modules: mpt3sas

Note that even lspci knows the driver is mpt3sas. Also, that driver appears to already be present in the installer, as I'm able to manually modprobe it in and then the disks show up as sda and sdb:

~ # modprobe mpt3sas
~ # cat /proc/partitions
major minor  #blocks  name

 259        1 6251223384 nvme0n1
 259        3 6251223384 nvme1n1
   8        0  468851544 sda
   8        1  468849664 sda1
   8       16  468851544 sdb
   8       17  468849664 sdb1

The question is why the Debian installer didn't load this automagically, and how we fix that so it doesn't prompt during reimaging.

Further updates on this thread:

  1. The installation attempts and debugging above were on bullseye, but our cp puppetization is actually still on buster
  2. When trying a reimage with buster, even loading mpt3sas manually from install console doesn't make /dev/sd[ab] appear. Probably the buster version of this driver lacks the PCI ID and/or support code for this newer card.
  3. It's not very feasible for us to suddenly rush the forward-porting of the whole cp software stack to bullseye yet, either, and we have a ton of these nodes already landed in ulsfo and landing shortly in eqsin.
  4. We're probably going to need to use a newer kernel and/or driver on buster for these installs!

I see our buster actually has linux-image-5.10.0-0.deb10.17-amd64 available in its repos. It may just be a matter of figuring out how to launch an installer on that kernel, and have it installed as the runtime one as well.

I've also found some other breadcrumbs. Runtime buster + 5.10 support is puppetized in modules/profile/manifests/base/linux510.pp. There's instructions about updating installer images for newer kernels in https://wikitech.wikimedia.org/wiki/Updating_netboot_image_with_newer_kernel , but I'm not sure any of that can be followed verbatim. We probably don't want to impact all buster installs, and instead want to create a new "os" named something like buster510 that can be passed as the --os argument to the reimage script.

The bits for the reimage cookbooks are trivial to do, Spicerack has already support for custom images, see the media_type argument to https://doc.wikimedia.org/spicerack/master/api/spicerack.dhcp.html#spicerack.dhcp.DHCPConfOpt82
It's currently not passed by the cookbook that let Spicerack use the default value, but it's trivial to add an additional --image CLI argument that it's by default installer and can be something else as long as it's present in the install hosts (ideally added via the puppet volatile).
cc @jBond_WMF

Volans added a subscriber: Unknown Object (User).Oct 3 2022, 8:47 PM

I've also found some other breadcrumbs. Runtime buster + 5.10 support is puppetized in modules/profile/manifests/base/linux510.pp. There's instructions about updating installer images for newer kernels in https://wikitech.wikimedia.org/wiki/Updating_netboot_image_with_newer_kernel , but I'm not sure any of that can be followed verbatim. We probably don't want to impact all buster installs, and instead want to create a new "os" named something like buster510 that can be passed as the --os argument to the reimage script.

I can look into preparing a 4.19+5.10 bootimage. Back at the time when we used this the DHCP config was still managed in Puppet (as such we simply set the different boot image in there, with the new scheme it will need to be added to the cookbook.

The question is why the Debian installer didn't load this automagically, and how we fix that so it doesn't prompt during reimaging.

This is in fact puzzling, my initial hunch was that maybe mpt3sas might be missing in the udebs (first it fetches a trimmed-down selection of packages spefifically for the installer), but mpt3sas is in fact also present within scsi-modules-5.10.0-16-amd64.udeb

d-i will see some radical changes in the near future to the extent that it will simply switch to using standard debs (and allow for Python in the installer) since the current design focused on minimal images originated from size-constrained architecture no longer relevant/support. Once the d-i userspace/kernel is pretty much identical to the eventual installed image, I'm sure this issue cedes entirely, but nothing which will help us immediately.

I think we can best "fix" this for now by updating modules/install_server/files/early_command.sh to modprobe mpt3sas if /proc/bus/pci/65/00.0 exists (we don't yet have lspci in early installer)? It's not very elegant, but should allow for an uninterrupted installation at least.

So, we have a need to move on this pretty quickly, as we have 16 new cache hosts in ulsfo pending installs on this, and then 16 more in eqsin right on their heels. In both cases we're facing limited rack space (have to decom-before-install as we go) and we can't store this stuff or put off the hardware swaps for very long. Trying to summarize our easiest path forward based on the above, I see a few things to tackle and some dependencies that kinda put them in this logical order:

  • Create a buster-based 4.19+5.10 boot image (I guess this means both are available in the base buster installer for everyone, subject to some boot/install-time param to use 510)
  • Add support for it (it being whatever it takes to switch to 5.10) to the reimage cookbook stuff
  • Try cp4045 installation again to confirm whether we still have an install-time problem with mpt3sas (if so, hack early_command or whatever to work around it)
  • For the runtime part: conditionally blend profile::base::linux510 into the p11n our newer-hardware cp hosts (we might move towards this for all of them just for consistency, but that may take longer and isn't blocking anything)

Does this sound like a reasonable path and interpretation of the above, and can we get a little help on especially the first bits?

I'll take care of "Create a buster-based 4.19+5.10 boot image " tomorrow.

I'll take care of "Create a buster-based 4.19+5.10 boot image " tomorrow.

Thank you!

Change 839627 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: support different installers

https://gerrit.wikimedia.org/r/839627

  • Add support for it (it being whatever it takes to switch to 5.10) to the reimage cookbook stuff

@BBlack the above patch should have all that's needed to specify a different installer to the reimage cookbook.
So basically, if the new installer will be named buster-installer.510 then you'll have to pass --pxe-media 'installer.510' to the reimage cookbook to use it.

Change 839627 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: support different installers

https://gerrit.wikimedia.org/r/839627

Change 840103 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.dhcp: support different installers

https://gerrit.wikimedia.org/r/840103

Change 840125 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch profile::base::linux510 to the new meta package

https://gerrit.wikimedia.org/r/840125

I created a new netinst environment based on the latest buster plus the 5.10.136 Linux kernel under /var/lib/puppet/volatile/tftpboot/buster-installer510. A reimage can be kicked off using

sudo cookbook sre.hosts.reimage $HOSTNAME --os buster --pxe-media installer510

Currently it doesn't run fully non-interactive yet, there's a dialogue being prompted: No kernel modules were found. This refers to pulling additional kernel udebs (minified debs used in d-i). These are the packages listed as *-modules* at https://tracker.debian.org/pkg/linux for the main Linux package. For the 5.10 kernel on Buster these udebs are not built, which makes sense since there is no installed with 5.10 for Buster: https://tracker.debian.org/pkg/linux-5.10
The dialogue can simply be ignored by selecting "Yes" via the serial console and d-i proceeds just fine.

I'll poke around in d-i to see how/if we can preseed it to not prompt at all.

Another patch which needs to be merged is https://gerrit.wikimedia.org/r/c/operations/puppet/+/840125. It ensures that the correct meta package gets installed to pull in 5.10. When this has landed, we need to also update the roles of the servers we want to install with Buster+5.10 to include profile::base::linux510 (and given that we want to stick with 4.19 outside of ulsfo/eqsin conditonalise this as well via Hiera).

Then we can test the mpt3sas autoloading on one of the new servers and if still necessary puppetise a hack which modprobes mpt3sas in modules/install_server/files/early_command.sh for [ -f /proc/bus/pci/65/00.0 ]

This unblocks us with the installs and in parallel I'll work with the Debian kernel and installer people to hopefully fix autoloading in Bullseye proper (which will trickle into Buster via the rebuilds into src:linux-5.10).

Thanks for the update and for working on this!

I created a new netinst environment based on the latest buster plus the 5.10.136 Linux kernel under /var/lib/puppet/volatile/tftpboot/buster-installer510. A reimage can be kicked off using

sudo cookbook sre.hosts.reimage $HOSTNAME --os buster --pxe-media installer510

Currently it doesn't run fully non-interactive yet, there's a dialogue being prompted: No kernel modules were found. This refers to pulling additional kernel udebs (minified debs used in d-i). These are the packages listed as *-modules* at https://tracker.debian.org/pkg/linux for the main Linux package. For the 5.10 kernel on Buster these udebs are not built, which makes sense since there is no installed with 5.10 for Buster: https://tracker.debian.org/pkg/linux-5.10
The dialogue can simply be ignored by selecting "Yes" via the serial console and d-i proceeds just fine.

I'll poke around in d-i to see how/if we can preseed it to not prompt at all.

Another patch which needs to be merged is https://gerrit.wikimedia.org/r/c/operations/puppet/+/840125. It ensures that the correct meta package gets installed to pull in 5.10. When this has landed, we need to also update the roles of the servers we want to install with Buster+5.10 to include profile::base::linux510 (and given that we want to stick with 4.19 outside of ulsfo/eqsin conditonalise this as well via Hiera).

That's a good idea -- I will take care of the conditional part for ulsfo/eqsin so that we can only apply the profile for those sites.

Change 840162 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:base: configure Linux 5.10 on buster via Hiera

https://gerrit.wikimedia.org/r/840162

Change 840103 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.dhcp: support different installers

https://gerrit.wikimedia.org/r/840103

Change 840125 merged by Muehlenhoff:

[operations/puppet@production] Switch profile::base::linux510 to the new meta package

https://gerrit.wikimedia.org/r/840125

Change 841134 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch profile::base::linux510 to the new meta package

https://gerrit.wikimedia.org/r/841134

Change 840162 merged by Ssingh:

[operations/puppet@production] P:base: configure Linux 5.10 on buster via Hiera

https://gerrit.wikimedia.org/r/840162

Change 841536 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch profile::base::linux510 to the new meta package

https://gerrit.wikimedia.org/r/841536

Change 841536 merged by Muehlenhoff:

[operations/puppet@production] Switch profile::base::linux510 to the new meta package

https://gerrit.wikimedia.org/r/841536

Change 841134 abandoned by Muehlenhoff:

[operations/puppet@production] Switch profile::base::linux510 to the new meta package

Reason:

Different patch was merged

https://gerrit.wikimedia.org/r/841134

Change 841923 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: use Linux 5.10 on cp4045 (buster)

https://gerrit.wikimedia.org/r/841923

Change 841923 merged by Ssingh:

[operations/puppet@production] hiera: use Linux 5.10 on cp4045 (buster)

https://gerrit.wikimedia.org/r/841923

Change 841936 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Install 5.10 in late_setup.sh for next Gen PowerEdges

https://gerrit.wikimedia.org/r/841936

Change 841936 merged by Muehlenhoff:

[operations/puppet@production] Install 5.10 in late_setup.sh for next Gen PowerEdges

https://gerrit.wikimedia.org/r/841936

Change 841952 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp4045: update site.pp and related configs for cp role

https://gerrit.wikimedia.org/r/841952

Thanks to @MoritzMuehlenhoff and @Volans for their help in resolving the buster Linux 5.10 issue!

sukhe@cp4045:~$ uname -r
5.10.0-0.deb10.17-amd64

Change 841952 merged by Ssingh:

[operations/puppet@production] cp4045: update site.pp and related configs for cp role

https://gerrit.wikimedia.org/r/841952

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors:

  • cp4045 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors:

  • cp4045 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4045.ulsfo.wmnet with OS buster executed with errors:

  • cp4045 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

, @MoritzMuehlenhoff wrote:
Currently it doesn't run fully non-interactive yet, there's a dialogue being prompted: No kernel modules were found. This refers to pulling additional kernel udebs (minified debs used in d-i). These are the packages listed as *-modules* at https://tracker.debian.org/pkg/linux for the main Linux package. For the 5.10 kernel on Buster these udebs are not built, which makes sense since there is no installed with 5.10 for Buster: https://tracker.debian.org/pkg/linux-5.10
The dialogue can simply be ignored by selecting "Yes" via the serial console and d-i proceeds just fine.

I did some spelunking in the various d-i git repos, but unfortunately there is no preseed setting which would mitigate this; the detection of the missing kernel modules happens within anna (a minified version of apt and one of the few d-i components written in C) and there is no option to override the check (or have it ignore the outcome). The check _is_ actually disabled on Hurd/GNU Mach kernels, but we can't easily rebuild d-i either since the udeb for anna is fetched from the Debian repository during early boot.

So unless the Traffic team wants to move the caches/LVSes to Hurd we'll have to live with prompting the dialogue over the serial console when installing buster with Linux 5.10 :-)

Just as an additional datapoint, if you connect to the console and anwer the question while the cookbook is running, it will happily continue once d-i has finished its job as long as it's within the cookbook's polling timeout.

Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our repo? Or does it have to come straight from debian?

, @MoritzMuehlenhoff wrote:
Currently it doesn't run fully non-interactive yet, there's a dialogue being prompted: No kernel modules were found. This refers to pulling additional kernel udebs (minified debs used in d-i). These are the packages listed as *-modules* at https://tracker.debian.org/pkg/linux for the main Linux package. For the 5.10 kernel on Buster these udebs are not built, which makes sense since there is no installed with 5.10 for Buster: https://tracker.debian.org/pkg/linux-5.10
The dialogue can simply be ignored by selecting "Yes" via the serial console and d-i proceeds just fine.

I did some spelunking in the various d-i git repos, but unfortunately there is no preseed setting which would mitigate this; the detection of the missing kernel modules happens within anna (a minified version of apt and one of the few d-i components written in C) and there is no option to override the check (or have it ignore the outcome). The check _is_ actually disabled on Hurd/GNU Mach kernels, but we can't easily rebuild d-i either since the udeb for anna is fetched from the Debian repository during early boot.

So unless the Traffic team wants to move the caches/LVSes to Hurd we'll have to live with prompting the dialogue over the serial console when installing buster with Linux 5.10 :-)

Thanks for looking into it and the additional context!

On the Traffic side, the image + cookbook patch is working for us. The only issue being -- and I am curious if it was just me or this is a known issue -- is that because the setup process depends on the connection to the serial console, it is somewhat brittle. Yesterday, twice, the cookbook stalled because it was waiting for the host to come up after the reboot and the host never did.

Enumerating Boot options...
Enumerating Boot options... Done
Loading Lifecycle Controller Drivers...
Loading Lifecycle Controller Drivers...Done
Lifecycle Controller: Collecting System Inventory...
iDRAC IPV4:  10.128.128.14 [Dedicated]
Lifecycle Controller: Done
Booting...

I did a manual reboot of the host though and it came back up just fine. This makes me wonder if the installer was waiting for me to "press yes" and I lost the output from the serial console for that part.

Anyway, a subsequent run later that night with the exact same commands and no other changes worked without any issues, so I am not sure if it was a one-off (two, rather) or some other issue with keeping the connection to the serial console and its output being rendered for us to press "yes". If this happens again, I will update this ticket.

Is it possible to fake this out with a bunch of trivially-built empty udebs that are in our repo? Or does it have to come straight from debian?

Not easily, it's trying to fetch the udebs from the Debian mirror (which is our unmodified Debian mirror mirrors.wikimedia.org), we can't instruct it to fetch the udebs from apt.wikimedia.org

On the Traffic side, the image + cookbook patch is working for us. The only issue being -- and I am curious if it was just me or this is a known issue -- is that because the setup process depends on the connection to the serial console, it is somewhat brittle. Yesterday, twice, the cookbook stalled because it was waiting for the host to come up after the reboot and the host never did.

Maybe I misunderstand, but the cookbook waits for completion of the reboot, so what made the host not come back up? Was that because you initiated the serial console connection after the server had already prompted the UI question?

Change 849202 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] cp4038: update site.pp and related configs for cp role

https://gerrit.wikimedia.org/r/849202

Change 849202 merged by Ssingh:

[operations/puppet@production] cp4038: update site.pp and related configs for cp role

https://gerrit.wikimedia.org/r/849202

I think we can close this task and mark it as resolved. The original purpose for which this was required has now been met and future reimages for the cp hosts are on bullseye so this won't be required.

Marking as resolved; if you disagree, please feel free to reopen.