Page MenuHomePhabricator

Add support for Broadcom RAID controllers using storcli
Closed, ResolvedPublic

Description

In T391854 we tested a new Broadcom RAID controller using storcli instead of perccli. To use this in production we need to

  • extend modules/raid/lib/facter/raid.rb (one complication is that our test device from T391854 uses the same PCI ID as the systems we currently run with perccli (the same in lspci is also identical, so probably this is very similar hard under the hood). given that the current matching operates solely on the PCI ID, we'll need an additional Hiera flag this this to opt-in to the use or storcli over perccli for selected servers)
  • add a modules/raid/manifests/storcli.pp class which install storcli
  • setup monitoring similar to what we do for existing RAID controllers
  • adapt the tooling which opens Phabricator tasks on RAID failures to also cover storcli

Event Timeline

My 2c: before starting we should decide if what controller we want to use, because in T391854 it seems that we may be oriented in buying the new one and upgrading all the supermicro hosts that we have for ms-be. If so the pci-id will be different, so no clash with the existing SAS ones.

My 2c: before starting we should decide if what controller we want to use, because in T391854 it seems that we may be oriented in buying the new one and upgrading all the supermicro hosts that we have for ms-be. If so the pci-id will be different, so no clash with the existing SAS ones.

But does it really have a new one? the controller in ms-be1091 has the same PCI ID as existing perccli systems?

Maybe I got the wrong PCI via lspci, but I see:

elukey@ms-be1091:~$ lspci -nn | grep -i sas
98:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx [1000:00e6]

vs

elukey@ms-be1090:~$ lspci -nn | grep -i sas
98:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx [1000:10e2]

In raid.rb we have 100010e2 for perccli, the new controller would need 100000e6, am I missing something?

Volans triaged this task as Medium priority.May 5 2025, 2:20 PM

Today I was reviewing the alerts for perccli-related nagios checks, and I found non-ms-be nodes that will likely keep the current controller:

  1. db2243, db1257 - Supermicro config E hosts
  2. backup1012, backup2012 - Supermicro config J hosts, but without JBOD requirements.

For those we'll need to implement what Moritz was suggesting via the hiera flag, so I'll get started the work.

Change #1142518 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] raid: update facter and get-raid-status to allow storcli

https://gerrit.wikimedia.org/r/1142518

Change #1142518 merged by Elukey:

[operations/puppet@production] raid: update facter and get-raid-status to allow storcli

https://gerrit.wikimedia.org/r/1142518

Change #1142978 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] raid::broadcom: fix perccli package name

https://gerrit.wikimedia.org/r/1142978

Change #1142978 merged by Elukey:

[operations/puppet@production] raid::broadcom: fix perccli package name

https://gerrit.wikimedia.org/r/1142978

Change #1143023 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] raid: allow OK in general state for get-raid-status-broadcom.py

https://gerrit.wikimedia.org/r/1143023

Change #1143023 merged by Elukey:

[operations/puppet@production] raid: allow OK in general state for get-raid-status-broadcom.py

https://gerrit.wikimedia.org/r/1143023

Change #1143026 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] raid: fix get-raid-status-broadcom.py script

https://gerrit.wikimedia.org/r/1143026

Change #1143026 merged by Elukey:

[operations/puppet@production] raid: fix get-raid-status-broadcom.py script

https://gerrit.wikimedia.org/r/1143026

Change #1143052 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] icinga: update raid_handler.py with 'broadcom' instead of 'perccli'

https://gerrit.wikimedia.org/r/1143052

Change #1143052 merged by Elukey:

[operations/puppet@production] icinga: update raid_handler.py with 'broadcom' instead of 'perccli'

https://gerrit.wikimedia.org/r/1143052

elukey claimed this task.

Summary:

  • Renamed the perccli nagios check to a more generic broadcom, that is able to use storcli (where available) or perccli.
  • Added some puppet code to install storcli if the manufacturer fact is Supermicro.
  • Fixed raid_handler.py to support 'broadcom' and not 'perccli'

Since now the nagios check is a more generic version than perccli, I don't believe that any more actions are needed. Please reopen if necessary!