Page MenuHomePhabricator

Spicerack: expand Supermicro support in the Redfish module
Open, MediumPublic

Description

Right now the Redfish support for Supermicro in Spicerack is minimal and doesn't have dedicated support for configuring BIOS, BMC, network cards and PXE. It also lacks any specific support useful for firmware upgrade.

With the introduction of supermicro hosts in the fleet we need to start expanding our support so that the sre.hosts.provision and sre.hardware.upgrade-firmware cookbooks could be expanded to support also Supermicro hosts.

Implementation details to be investigated and discussed.

Event Timeline

Volans triaged this task as Medium priority.May 20 2024, 4:09 PM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1036704 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: expand support for Supermicro hosts

https://gerrit.wikimedia.org/r/1036704

Change #1036704 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: expand support for Supermicro hosts

https://gerrit.wikimedia.org/r/1036704

Change #1037573 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.host.provision: no-op refactor to highlight DELL-specific confs

https://gerrit.wikimedia.org/r/1037573

Change #1037806 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro

https://gerrit.wikimedia.org/r/1037806

Network config for kubernetes2054 as seen by Redfish (supermicro):

>>> pprint(a.request("get", "/redfish/v1/Managers/1/EthernetInterfaces/1").json())
{'@odata.id': '/redfish/v1/Managers/1/EthernetInterfaces/1',
 '@odata.type': '#EthernetInterface.v1_5_3.EthernetInterface',
 'AutoNeg': True,
 'DHCPv4': {'DHCPEnabled': False,
            'FallbackAddress': 'None',
            'UseDNSServers': False,
            'UseDomainName': False,
            'UseGateway': False,
            'UseNTPServers': False,
            'UseStaticRoutes': False},
 'DHCPv6': {'OperatingMode': 'Stateless',
            'UseDNSServers': False,
            'UseDomainName': False,
            'UseNTPServers': False,
            'UseRapidCommit': False},
 'Description': 'Management Network Interface',
 'FQDN': '',
 'FullDuplex': True,
 'HostName': 'kubernetes2054',
 'IPv4Addresses': [{'Address': '10.193.2.82',
                    'AddressOrigin': 'Static',
                    'Gateway': '10.193.0.1',
                    'SubnetMask': '255.255.0.0'}],
 'IPv4StaticAddresses': [{'Address': '10.193.2.82',
                          'Gateway': '10.193.0.1',
                          'SubnetMask': '255.255.0.0'}],
 'IPv6Addresses': [{'Address': 'fe80:0:0:0:7ec2:55ff:fe50:fca8',
                    'AddressOrigin': 'LinkLocal',
                    'AddressState': 'Preferred',
                    'PrefixLength': 64}],
 'IPv6StaticAddresses': [{'Address': '::', 'PrefixLength': 64},
                         {'Address': '::', 'PrefixLength': 64},
                         {'Address': '::', 'PrefixLength': 64},
                         {'Address': '::', 'PrefixLength': 64},
                         {'Address': '::', 'PrefixLength': 64}],
 'Id': '1',
 'LinkStatus': 'LinkUp',
 'MACAddress': '7C:C2:55:50:FC:A8',
 'MTUSize': 1500,
 'MaxIPv6StaticAddresses': 5,
 'Name': 'Manager Ethernet Interface',
 'NameServers': ['8.8.8.8', '0.0.0.0', '::', '::'],
 'Oem': {'Supermicro': {'@odata.type': '#SmcEthernetInterfaceExtensions.v1_0_0.EthernetInterface',
                        'IPProtocolStatus': 'Dual'}},
 'SpeedMbps': 1000,
 'SpeedMbps@Redfish.AllowableValues': [100, 1000],
 'StatelessAddressAutoConfig': {'IPv4AutoConfigEnabled': False,
                                'IPv6AutoConfigEnabled': True},
 'Status': {'Health': 'OK', 'State': 'Enabled'},
 'VLAN': {'VLANEnable': False, 'VLANId': 1}}

I checked the BIOS settings of kubernetes2054 (Supermicro nodes already configured by DCops) and they are hundreds, so it is not easy to figure out the ones that would need to end up in the provision cookbook. Riccardo also mentioned that what DCops configures via the vendor UI may not correspond 1:1 with settings as seen by Redfish, that complicates things. The only reliable way forward is to wait for a new Supermicro node and check its factory settings vs the ones on kubernetes2054 - the diff will be the starting point for the settings that we'll add in the cookbook.

Next steps:

  • Refactor the provision cookbook to be less DELL specific and allow other vendors, like Supermicro. We are not going to make it fully generic and future proof yet, the goal is just to proceed in small steps and enable new functionality.
  • Add steps to allow BIOS and Network MGMT for Supermicro, test it so that DCops will not be required to be physically close to the node to configure it (if the mgmt network is configured it can be reachable from anywhere, simplifying management).

Mentioned in SAL (#wikimedia-operations) [2024-06-05T13:46:35Z] <elukey> factory reset for sretest1001 to test the new provision cookbook - T365372

Change #1037573 merged by jenkins-bot:

[operations/cookbooks@master] sre.host.provision: no-op refactor to highlight DELL-specific confs

https://gerrit.wikimedia.org/r/1037573

First roadblock: https://www.supermicro.com/en/support/BMC_Unique_Password

It seems that every supermicro host is set with an ADMIN password that is available only on the host's label shipped with it (so it doesn't have a default known value). This complicates a little our workflow, since for DELLs we assume calvin as default password and we use it to set ours.

Summary so far:

  • The current draft for supermicro support in the provision cookbook is https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1037806
  • Missing functionalities for provision:
    • DHCP support, the spicerack dhcp module needs to know what hostname the mgmt/bmc has (like serial, etc..) to create the correct initial config. We need to tcpdump on installXXXX when a new node arrives.
    • PXE settings, do we need to set anything like LegacyBootOption on Supermicro to enable PXE on a certain NIC? The only Supermicro node that we have, kubernetes2054, has two 1G NICs and it didn't need any special config. We'll get new nodes with 10G NICs as well, so things will change, but we cannot really test right now.
    • Console redirection seems already working, but better to double check.
    • RAID HW, we need to understand the format of a config.

I think we can pause the provision work until a new Supermicro nodes arrives, we are in a good state but without testing it will be difficult to come up with working code.

elukey moved this task from In Progress to Stalled on the User-Elukey board.

We have sretest2001 racked and connected to mgmt network, and it is a Supermicro node. I tried to tcpdump the DHCP traffic on install2004 with Riccardo and we got this:

elukey@install2004:~$ sudo tcpdump -vvvv 'udp and (src port 67 or src port 68 or src port 69)' -A
[..]
    mr1-codfw.mgmt.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: [udp sum ok] BOOTP/DHCP, Request from 7c:c2:55:52:4a:1c (oui Unknown), length 321, xid 0x1e71954, secs 7626, Flags [none] (0x0000)
	  Gateway-IP mr1-codfw.mgmt.codfw.wmnet
	  Client-Ethernet-Address 7c:c2:55:52:4a:1c (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Discover
	    Client-ID (61), length 7: ether 7c:c2:55:52:4a:1c
	    MSZ (57), length 2: 576
	    Parameter-Request (55), length 7: 
	      Subnet-Mask (1), Default-Gateway (3), Domain-Name-Server (6), Hostname (12)
	      Domain-Name (15), BR (28), NTP (42)
	    Vendor-Class (60), length 12: "udhcp 1.32.1"
	    Agent-Information (82), length 19: 
	      Unknown SubOption 9, length 17: 
		0x0000:  0000 0a4c 0c04 0a67 652d 302f 302f 302e
		0x0010:  30
	    END (255), length 0
	    PAD (0), length 0, occurs 20

The Hostname field seems empty, and the above should be sretests2001's BMC (at least, the mac address brings to Supermicro and we don't have other hosts racked atm).

Riccardo got suspicious and the same thing seems to happen in eqiad, for what it seems a DELL idrac:

elukey@install1004:~$ sudo tcpdump -vvvv 'udp and (src port 67 or src port 68 or src port 69)' -A
[..]
    mr1-eqiad.mgmt.eqiad.wmnet.bootps > install1004.wikimedia.org.bootps: [udp sum ok] BOOTP/DHCP, Request from c4:5a:b1:1a:64:06 (oui Unknown), length 295, xid 0xb4be1b5b, secs 12706, Flags [none] (0x0000)
	  Gateway-IP mr1-eqiad.mgmt.eqiad.wmnet
	  Client-Ethernet-Address c4:5a:b1:1a:64:06 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message (53), length 1: Discover
	    Client-ID (61), length 7: ether c4:5a:b1:1a:64:06
	    MSZ (57), length 2: 576
	    Parameter-Request (55), length 8: 
	      Subnet-Mask (1), Default-Gateway (3), Domain-Name-Server (6), Hostname (12)
	      Domain-Name (15), BR (28), NTP (42), Unknown (248)
	    Vendor-Class (60), length 12: "udhcp 1.21.1"
	    Agent-Information (82), length 12: 
	      Circuit-ID SubOption 1, length 10: ge-0/0/0.0
	    END (255), length 0

Possible issues:

  • Luca or Riccardo don't know how to use tcpdump and this is why we don't see the Hostname
  • MR devices are filtering DHCP traffic for some $reason related to $new-settings.

@cmooney @ayounsi Do you mind to check when you have a moment? I suspect this is my PEBCAK but in the chance it isn't lemme know your thoughts :D

IIUC we are missing DHCP's option 12 from the BMC's client. On DELL's we expect something like:

Hostname Option 12, length 13: “idrac-ABC1234”

For Supermicro, this is still unknown. I'll try to explicitly reset to factory default sretest1001's BMC (Dell), to double check if on Dells we drop the option 12 or not.

Mentioned in SAL (#wikimedia-operations) [2024-06-13T12:39:18Z] <elukey> reset BIOS/BMC to factory default on sretest1001 - T365372

I can confirm that the sretest1001's BMC sends this:

DHCP-Message (53), length 1: Discover
Hostname (12), length 13: "idrac-XXXXX"
Vendor-Class (60), length 5: "iDRAC"

So it seems that Supermicro's BMC, by default, doesn't do it :(

From https://www.supermicro.com/support/faqs/faq.cfm?faq=24257 it seems as if Supermicro's BMC sends the Hostname option only if a value is provided by the admin.

Note for me - this is an example of snippet generated by the provision cookbook to instruct the DHCP server to assign an IP to idrac mgmts:

elukey@install1004:~$ cat /etc/dhcp/automation/proxies/mgmt-eqiad.conf
# Automatically generated by dhcpincludes for /etc/dhcp/automation/mgmt-eqiad/
include "/etc/dhcp/automat/etc/dhcp/automation/mgmt-eqiad/sretest1001.mgmt.eqiad.wmnet.conf004:~$ cat /etc/dhcp/automation/mgmt-eqiad/sretest1001.mgmt.eqiad.wmnet.conf

class "sretest1001.mgmt.eqiad.wmnet" {
    match if (lcase(option host-name) = "idrac-XXXXXXX");
}
pool {
    allow members of "sretest1001.mgmt.eqiad.wmnet";
    range 10.65.1.13 10.65.1.13;
}

Change #1043804 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: add property for storage manager URI

https://gerrit.wikimedia.org/r/1043804

Change #1043804 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: add property for storage manager URI

https://gerrit.wikimedia.org/r/1043804

Change #1046734 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] redfish: simplify interface of Redfish classes

https://gerrit.wikimedia.org/r/1046734

Change #1046734 merged by jenkins-bot:

[operations/software/spicerack@master] redfish: simplify interface of Redfish classes

https://gerrit.wikimedia.org/r/1046734

Current status:

  • We are following up with Supermicro to customize the default root password for the BMC (since now it is a custom one for each server) and we are trying to push them to have a different default mgmt DHCP config (similar to what Dell has, namely using a proper Hostname field when sending DHCPREQUEST packets).
  • T365167#9932231, Papaul mentioned an issue with licensing, namely in order to use Redfish we need to have a special license on the servers. It is unclear how the license is applied (if provision will have to do it or not), more details will hopefully follow soon.

The task is basically blocked at least until we understand the licensing problem, but IIUC Papaul and Willy are already working on it.

Things to decide:

  • In the near future we'll receive 10/20 hosts that we have already ordered, and they will not have any of the changes that we requested of course. So we'll have two possible code paths for the provision cookbook, and we'll need to decide what to support. For example, if we want to provide support for these nodes and their default DHCP config is not usable by our spicerack automation, we'll have to use the mac-address for this special use case.

Change #1052311 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/software/spicerack@master] redfish: add the add_account function

https://gerrit.wikimedia.org/r/1052311

It seems clear that for the foreseeable future (next 6/8 months) we will not have the DHCP hostname configured in Supermicro servers. The cleanest option is to use the mac-address, but since we don't save it in Netbox we'll need to find a way to include it if needed.

Two options:

The DCops team asked, if possible, to implement the first option. Usually they run the network provision script when they are not in the DC, meanwhile they add the device when they are in the DC (so they have all info available).

@ayounsi We started a chat in #dcops about this, lemme know your preference, or if there is a better way to do it.

I'd suggest to abstract the device creation by a custom script or cookbook. This could run additional safeguards, ask for addition information (and store them where necessary), and hide the unnecessary fields.

The first option unfortunately goes against proper modelling of our infrastructure.

There is possibly a variant of option 1:

  • Create a new custom script to add devices, which has a field for the MAC address
  • Have that script use the entered data to both add the device, and create the mgmt interface on it, setting the MAC on it.

Overall I don't feel strongly. If it's more convenient for dc-ops to enter the MAC when adding the device, rather than when adding its network link, then let's capture it at that point. All else being equal I do think it's better to store the mgmt int MAC on the mgmt int, rather than in a custom field, but not a big deal if we use a custom field either.

@Papaul the proposal that would be the best compromise is to add a "mgmt mac-address" field to https://netbox.wikimedia.org/extras/scripts/provision_server.ProvisionServerNetwork, rather than doing when adding the device. IIRC your proposal was to copy/paste the mgmt mac address in the racking task, so that it would be available to DCops even when working outside the DC. Is it something that we can agree on? If so I'll proceed with the implementation :)