Page MenuHomePhabricator

Spicerack: expand Supermicro support in the Redfish module
Closed, ResolvedPublic

Description

Right now the Redfish support for Supermicro in Spicerack is minimal and doesn't have dedicated support for configuring BIOS, BMC, network cards and PXE. It also lacks any specific support useful for firmware upgrade.

With the introduction of supermicro hosts in the fleet we need to start expanding our support so that the sre.hosts.provision and sre.hardware.upgrade-firmware cookbooks could be expanded to support also Supermicro hosts.

Implementation details to be investigated and discussed.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/cookbooksmaster+9 -11
operations/cookbooksmaster+2 -1
operations/cookbooksmaster+395 -271
operations/cookbooksmaster+8 -0
operations/cookbooksmaster+64 -32
operations/cookbooksmaster+4 -3
operations/cookbooksmaster+1 -1
operations/cookbooksmaster+15 -4
operations/cookbooksmaster+55 -32
operations/cookbooksmaster+27 -2
operations/cookbooksmaster+32 -20
operations/cookbooksmaster+4 -3
operations/cookbooksmaster+1 -0
operations/cookbooksmaster+145 -38
operations/software/spicerackmaster+21 -7
operations/software/spicerackmaster+6 -5
operations/software/spicerackmaster+8 -5
operations/software/spicerackmaster+8 -5
operations/software/spicerackmaster+92 -4
operations/software/netbox-extrasmaster+27 -5
operations/software/spicerackmaster+108 -14
operations/cookbooksmaster+12 -14
operations/cookbooksmaster+87 -50
operations/software/spicerackmaster+38 -79
operations/software/spicerackmaster+28 -0
operations/software/spicerackmaster+63 -18
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

IIUC we are missing DHCP's option 12 from the BMC's client. On DELL's we expect something like:

Hostname Option 12, length 13: “idrac-ABC1234”

For Supermicro, this is still unknown. I'll try to explicitly reset to factory default sretest1001's BMC (Dell), to double check if on Dells we drop the option 12 or not.

Mentioned in SAL (#wikimedia-operations) [2024-06-13T12:39:18Z] <elukey> reset BIOS/BMC to factory default on sretest1001 - T365372

I can confirm that the sretest1001's BMC sends this:

DHCP-Message (53), length 1: Discover
Hostname (12), length 13: "idrac-XXXXX"
Vendor-Class (60), length 5: "iDRAC"

So it seems that Supermicro's BMC, by default, doesn't do it :(

From https://www.supermicro.com/support/faqs/faq.cfm?faq=24257 it seems as if Supermicro's BMC sends the Hostname option only if a value is provided by the admin.

Note for me - this is an example of snippet generated by the provision cookbook to instruct the DHCP server to assign an IP to idrac mgmts:

elukey@install1004:~$ cat /etc/dhcp/automation/proxies/mgmt-eqiad.conf
# Automatically generated by dhcpincludes for /etc/dhcp/automation/mgmt-eqiad/
include "/etc/dhcp/automat/etc/dhcp/automation/mgmt-eqiad/sretest1001.mgmt.eqiad.wmnet.conf004:~$ cat /etc/dhcp/automation/mgmt-eqiad/sretest1001.mgmt.eqiad.wmnet.conf

class "sretest1001.mgmt.eqiad.wmnet" {
    match if (lcase(option host-name) = "idrac-XXXXXXX");
}
pool {
    allow members of "sretest1001.mgmt.eqiad.wmnet";
    range 10.65.1.13 10.65.1.13;
}

Change #1043804 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: add property for storage manager URI

https://gerrit.wikimedia.org/r/1043804

Change #1043804 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: add property for storage manager URI

https://gerrit.wikimedia.org/r/1043804

Change #1046734 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] redfish: simplify interface of Redfish classes

https://gerrit.wikimedia.org/r/1046734

Change #1046734 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: simplify interface of Redfish classes

https://gerrit.wikimedia.org/r/1046734

Current status:

  • We are following up with Supermicro to customize the default root password for the BMC (since now it is a custom one for each server) and we are trying to push them to have a different default mgmt DHCP config (similar to what Dell has, namely using a proper Hostname field when sending DHCPREQUEST packets).
  • T365167#9932231, Papaul mentioned an issue with licensing, namely in order to use Redfish we need to have a special license on the servers. It is unclear how the license is applied (if provision will have to do it or not), more details will hopefully follow soon.

The task is basically blocked at least until we understand the licensing problem, but IIUC Papaul and Willy are already working on it.

Things to decide:

  • In the near future we'll receive 10/20 hosts that we have already ordered, and they will not have any of the changes that we requested of course. So we'll have two possible code paths for the provision cookbook, and we'll need to decide what to support. For example, if we want to provide support for these nodes and their default DHCP config is not usable by our spicerack automation, we'll have to use the mac-address for this special use case.

Change #1052311 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: add the add_account function

https://gerrit.wikimedia.org/r/1052311

It seems clear that for the foreseeable future (next 6/8 months) we will not have the DHCP hostname configured in Supermicro servers. The cleanest option is to use the mac-address, but since we don't save it in Netbox we'll need to find a way to include it if needed.

Two options:

The DCops team asked, if possible, to implement the first option. Usually they run the network provision script when they are not in the DC, meanwhile they add the device when they are in the DC (so they have all info available).

@ayounsi We started a chat in #dcops about this, lemme know your preference, or if there is a better way to do it.

I'd suggest to abstract the device creation by a custom script or cookbook. This could run additional safeguards, ask for addition information (and store them where necessary), and hide the unnecessary fields.

The first option unfortunately goes against proper modelling of our infrastructure.

There is possibly a variant of option 1:

  • Create a new custom script to add devices, which has a field for the MAC address
  • Have that script use the entered data to both add the device, and create the mgmt interface on it, setting the MAC on it.

Overall I don't feel strongly. If it's more convenient for dc-ops to enter the MAC when adding the device, rather than when adding its network link, then let's capture it at that point. All else being equal I do think it's better to store the mgmt int MAC on the mgmt int, rather than in a custom field, but not a big deal if we use a custom field either.

@Papaul the proposal that would be the best compromise is to add a "mgmt mac-address" field to https://netbox.wikimedia.org/extras/scripts/provision_server.ProvisionServerNetwork, rather than doing when adding the device. IIRC your proposal was to copy/paste the mgmt mac address in the racking task, so that it would be available to DCops even when working outside the DC. Is it something that we can agree on? If so I'll proceed with the implementation :)

Next steps:

Change #1057927 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: fix dell_config_changes

https://gerrit.wikimedia.org/r/1057927

Change #1057927 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: fix dell_config_changes

https://gerrit.wikimedia.org/r/1057927

Change #1052311 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: add the add_account function

https://gerrit.wikimedia.org/r/1052311

Change #1057826 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/netbox-extras@master] provision_server.py: add mac address to network provision script

https://gerrit.wikimedia.org/r/1057826

Change #1057826 merged by jenkins-bot:

[operations/software/netbox-extras@master] provision_server.py: add mac address to network provision script

https://gerrit.wikimedia.org/r/1057826

Updates:

  • The Netbox custom script for network provisioning is now asking for a mac address (for the mgmt interface), mandatory for each supermicro.
  • spicerack's redfish module is now able to create admin users in the BMC (only for supermicro).

Next steps:

Change #1060854 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82

https://gerrit.wikimedia.org/r/1060854

Encountered an issue with the BMC's network config:

supermicro_mgmt_network_changes = {
    "HostName": "sretest2001",
    "FQDN": "sretest2001",
    "DHCPv4": {
        "DHCPEnabled": False,
        "FallbackAddress": "None",
        "UseDNSServers": False,
        "UseDomainName": False,
        "UseGateway": False,
        "UseNTPServers": False,
        "UseStaticRoutes": False
    },
    "IPv4Addresses": {
        "Address": "10.193.2.198",
        "AddressOrigin": "Static",
        "Gateway": "10.193.0.1",
        "SubnetMask": "255.255.0.0"
    },
    "NameServers": ["10.3.0.1"],
    "StatelessAddressAutoConfig": {
        'IPv6AutoConfigEnabled': False
    }
}

If I try to patch the /redfish/v1/Managers/1/EthernetInterfaces/1 URI, I get that some properties are read-only: DHCP, IPV4Addresses, NameServers. Some others like HostName, if patched alone, work.

From https://www.supermicro.com/support/faqs/faq.cfm?faq=33898 it seems that it is supposed to run fine, but the post is a little old so something might have changed.

EDIT: After checking the live config, I found some undocumented Static versions of the attributes that I want to modify. This is what I came up with:

"HostName": "sretest2001",
"FQDN": "sretest2001",
"IPv4StaticAddresses": [{
    "Address": "10.193.2.198",
    "Gateway": "10.193.0.1",
    "SubnetMask": "255.255.0.0"
}],
"StaticNameServers": ["10.3.0.1"],
"StatelessAddressAutoConfig": {
    'IPv6AutoConfigEnabled': False
}

DHCP has no static option to set, so I guess we can skip it.

Change #1060854 merged by jenkins-bot:

[operations/software/spicerack@master] dhcp: allow empty distro for DHCPConfMac and DHCPConfOpt82

https://gerrit.wikimedia.org/r/1060854

Change #1070217 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: introduce the AccountManager URI for DELL

https://gerrit.wikimedia.org/r/1070217

Change #1070217 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: introduce the AccountManager URI for DELL

https://gerrit.wikimedia.org/r/1070217

Change #1070263 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: catch no-json-responses in change_user_password

https://gerrit.wikimedia.org/r/1070263

Change #1070263 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: catch no-json-responses in change_user_password

https://gerrit.wikimedia.org/r/1070263

Change #1070868 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: allow 200 responses in chassis_reset

https://gerrit.wikimedia.org/r/1070868

Change #1070907 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] tests: add more tests for Redfish's module change user

https://gerrit.wikimedia.org/r/1070907

Change #1070868 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: allow 200 responses in chassis_reset

https://gerrit.wikimedia.org/r/1070868

Change #1070907 merged by jenkins-bot:

[operations/software/spicerack@master] tests: add more tests for Redfish's module change user

https://gerrit.wikimedia.org/r/1070907

I've released spicerack 8.13.0 that collects the latest changes for the redfish module, and installed on cumin2002. The cookbook seems ready to go (https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/10378060) but I'd like to test it on sretest2001 first. I have factory-reset it, but now I think it is missing the Redfish license, so I need to wait DCops to redeploy it.

Change #1037806 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro

https://gerrit.wikimedia.org/r/1037806

Great news, the first version of the Supermicro support in provision is live on cumin nodes (namely the cookbook now supports it).

Change #1071553 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: enable virtualization for Supermicro

https://gerrit.wikimedia.org/r/1071553

Change #1071913 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.hosts.provision: Fix --no-users

https://gerrit.wikimedia.org/r/1071913

Change #1071913 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: Fix --no-users

https://gerrit.wikimedia.org/r/1071913

Nasty issue found for sretest2001: T365167#10140713

In the provision cookbook we loop through the NICs and check the one with a link status up, setting it as default PXE NIC to use. In this case Redfish for Supermicro doesn't return to use any good value, and our logic cannot be used. It is unclear where the problem lies, we'll have to check more hosts to confirm.

Change #1072553 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: refactor _config_dell_pxe()

https://gerrit.wikimedia.org/r/1072553

Nasty issue found for sretest2001: T365167#10140713

In the provision cookbook we loop through the NICs and check the one with a link status up, setting it as default PXE NIC to use. In this case Redfish for Supermicro doesn't return to use any good value, and our logic cannot be used. It is unclear where the problem lies, we'll have to check more hosts to confirm.

After a chat with Papaul, it seems that this is an issue with 10G NICs also present with DELL :(

Change #1071553 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: improve Supermicro's bios settings

https://gerrit.wikimedia.org/r/1071553

Change #1072553 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: refactor _config_dell_pxe()

https://gerrit.wikimedia.org/r/1072553

Cross-posting from T365167#10148384, where I am testing a reimage for sretest2001.

On sretest2001 we have 10G/25G capable NICs that are listed in Redfish with:

"RSC_WR_6SLOT1PCI_E4_0X16OPROM": "EFI"
"RSC_W_66G4SLOT1PCI_E4_0X16OPROM": "EFI"
"RSC_W_66G4SLOT2PCI_E4_0X16OPROM": "EFI"

As Jenn pointed out:

The RSC-W-66G4 is a riser for PCIe card and the 10G SFP card is on that one. So the options for it are in there.

I've also discovered this guide for the fixed boot order sequence, that is not available for hosts that are < X13 (all the ones that we ordered up to now IIUC, maybe except the ML ones). In any case, even if it was available, we wouldn't be able to use it since by default it appears to me that the NIC cards are all set to use EFI by default (unless we specifically set Legacy).

I am not able to check other Supermicro nodes at the moment (namely, their Redfish values) since we are still waiting for the licenses, but I can think about a workaround that could make things work: the provision cookbook could loop through all BIOS key,values, and once it finds a value with EFI it should just add the related key to the ones to modify, with the value Legacy. The provision cookbook would enable PXE on all ports, not only the one that we care about, a downside but something not terrible either imho.

Change #1073249 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: add PXE settings for Supermicro

https://gerrit.wikimedia.org/r/1073249

Change #1073249 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: add PXE settings for Supermicro

https://gerrit.wikimedia.org/r/1073249

Change #1078439 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: avoid a reboot if BIOS settings are already good

https://gerrit.wikimedia.org/r/1078439

Some notes:

ml-serve* Supermicro nodes are AMD CPU based, so some BIOS settings don't apply to them. At the moment the following shouldn't be tried (otherwise the Redfish call will fail):

"SerialPort2Attribute": "SOL",
"IntelVirtualizationTechnology": "Enable"

The only clear way ahead is to have a list of Netbox Device types that run AMD CPUs listed in the provision cookbook, and avoid the options in those cases.

Need also to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1078439, but it is not blocking anything on the DC Ops side.

Other test to make: on 10G hosts, DCops reported that OnboardLAN2OptionROM needed to be disabled to make everything boot. I want to test via Redfish if this is still the case.

Change #1078613 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: vary BIOS settings for Supermicro

https://gerrit.wikimedia.org/r/1078613

Change #1078439 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: avoid a reboot if BIOS settings are already good

https://gerrit.wikimedia.org/r/1078439

Change #1078613 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: vary BIOS settings for Supermicro

https://gerrit.wikimedia.org/r/1078613

Change #1078636 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: fix self.device_model_slug

https://gerrit.wikimedia.org/r/1078636

Change #1078636 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: fix self.device_model_slug

https://gerrit.wikimedia.org/r/1078636

The new version of the cookbook is deployed, I am running it on insetup hosts listed in T376121 so we can apply the same canonical config before they reach production traffic.

Change #1078667 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: fix supermicro amd virtualization settings

https://gerrit.wikimedia.org/r/1078667

Change #1078667 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: fix supermicro amd virtualization settings

https://gerrit.wikimedia.org/r/1078667

Change #1078726 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro

https://gerrit.wikimedia.org/r/1078726

Found a new interesting issue when running the provision cookbook for mc-misc2001:

"Message":"The value
 'null' for the property P1_AIOMAOC_AG_i2LAN1OPROM is of a different type than the property can accept.

So far IIUC the message is a bit misleading, since it means "the value that you provided is not allowed". I jumped to the BIOS settings, but before doing that I had to battle a bit with UEFI, since by default I was ending up to the UEFI Shell (TIL: "exit" issued in the EFI Shell brought me to BIOS).

Anyway, the BIOS looked a bit different (options-wise) from what I used to work on, and indeed P1_AIOMAOC_AG_i2LAN1OPROM listed only two values ("EFI" or "Disable"). The weird thing was that also all other values didn't allow "Legacy" anymore..

I went to the Boot panel, and forced "Legacy" in BootSelect, and all of a sudden all the options restarted to show/set "Legacy". The funny thing is that we do force BootSelect: Legacy in the cookbook, but I believe that some Supermicros are shipped with UEFI only settings by default, and event the options in Redfish are consistent with that.

The workaround that I have in mind is the following:

  • Send a patch request to redfish only to set BootSelect.
  • Wait 3 mins for the chassis reset
  • Reset also the other options that we are (that we'll have the right Legacy parameter).

Very interesting since it will be useful when we'll provide the UEFI options as well.

Change #1078726 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: avoid Redfish calls before DHCP for Supermicro

https://gerrit.wikimedia.org/r/1078726

Change #1078961 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: warn when the BMC firmware is old for Supermicro

https://gerrit.wikimedia.org/r/1078961

Last issue worth to report is T371416#10214548. The backup1012 host seems to have a very old firmware, from 2022, that doesn't accept some of the BMC network configs that work for the rest. I filed https://gerrit.wikimedia.org/r/1078961 to warn users, but we should upgrade the firmware on it and see if it fixes.

Change #1078961 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: warn when the BMC firmware is old for Supermicro

https://gerrit.wikimedia.org/r/1078961

Change #1080456 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: first refactor with vendor-specific classes

https://gerrit.wikimedia.org/r/1080456

Change #1081901 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: raise RuntimeError if Redfish returns an error

https://gerrit.wikimedia.org/r/1081901

Change #1080456 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: first refactor with vendor-specific classes

https://gerrit.wikimedia.org/r/1080456

Change #1081901 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: raise RuntimeError if Redfish returns an error

https://gerrit.wikimedia.org/r/1081901

Change #1084706 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: improve supermicro class

https://gerrit.wikimedia.org/r/1084706

Change #1084706 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: improve supermicro class

https://gerrit.wikimedia.org/r/1084706

elukey claimed this task.

I think that we can declare this task completed, we are still seeing some things to fix from time to time but it is getting part of the normal/regular maintenance. Multiple hosts from Supermicro are in production and the provision/reimage cookbooks worked nicely on them.