Page MenuHomePhabricator

Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution)
Open, MediumPublic

Description

Currently all firmware updates are applied via 1 of 3 means, all labor intensive in one form or another:

  • Dell: via the https interface of the idrac, logging in via web interface and clicking through 2 different menus to get to firmware upload screen, then 3 more click through and upload options then monitoring.
  • Dell: via onsite running their own tftp server via their laptop plugged directly into the mgmt network, they can then use a single command to apply firmware via tftp
  • HP: BIOS and ILOM via https, everything else (and including bios and ilom) via hp spp iso and crash cart.

Proposed Solutions:
A) single tftp server solution - punch the required holes in our network ACLs to allow our mgmt network to access our own tftp servers we already run for pxe loads.
Pros: scales within our current infrastructure (no unicorns)
Cons: mgmt network is accessible from our install servers, currently this is restricted to our bastion and cumin hosts for ssh, and cumin hosts for other protocols.

B) dual tftp server soltuion - create a ganeti instance (on our existing ganeti cluster) and tftp role to run just our firmware updates for dcops
Pros: this ganeti instance could use internal ip and thus be further isolated from the outer world, keeping non ssh access to mgmt restricted to internal ip hosts only (cumin and then this proposed host)
Cons: have to build and maintain another tftp server role within our infrastructure

This solution would also likely be leveraged by automation in future projects to automate firmware updates for hosts (currently an unassigned long term project). Since it would then include their automation, we likely want their input on this as well. Papaul has regular sync ups with that team & must regularly update firmware on hosts, so he has been added to this task as a subscriber. Chris has been added as someone who has to regularly update firmware on hosts, and Willy has been included to keep him apprised of potential changes to our workflows. (In this case, it would likely simplify workflows, reducing all Dell firmware updates to a single idrac command line versus multiple GUI steps or the command line solution with local laptop use.)

Event Timeline

RobH triaged this task as Medium priority.Wed, May 26, 10:34 PM
RobH created this task.

Oh, I added Arzhel so he is aware of any network asks and can potentially point out any hard blockers that are proposed.

I plan to bring this up in our next DC ops meeting, just in case this gets no input until then.

As one who worked on the installserver puppet roles in the past I'd say the cons of B aren't so bad. We should be able to reuse existing profiles and just combine the needed ones into a new role class.

So from the puppet side this doesn't seem very hard. Almost all the work seems to be in create an entire ganeti cluster (for just one VM).

I am thinking maybe it would be easier to NOT make it a virtual host and just find old hardware which can take a puppet role and afterwards be patched into the mgmt network?

As one who worked on the installserver puppet roles in the past I'd say the cons of B aren't so bad. We should be able to reuse existing profiles and just combine the needed ones into a new role class.

So from the puppet side this doesn't seem very hard. Almost all the work seems to be in create an entire ganeti cluster (for just one VM).

I am thinking maybe it would be easier to NOT make it a virtual host and just find old hardware which can take a puppet role and afterwards be patched into the mgmt network?

The use of a dedicated host doesn't scale to allow for easy failover, I want to avoid using dedicated hardware for such a lightweight role if possible!

But wouldn't the dedicated ganeti server need hardware then instead?

But wouldn't the dedicated ganeti server need hardware then instead?

My initial task description is stating ganeti instance not dedicated ganeti hardware.

Edit addition: i say 'create a ganeti server' at one point and 'this ganeti instance' on the next line so this is perhaps the confusion, correcting the first part of that line to read ganeti instance like it does in the followup.

B) dual tftp server soltuion - create a ganeti server and tftp role to run just our firmware updates for dcops
Pros: this ganeti instance could use internal ip and thus be further isolated from the outer world, keeping non ssh access to mgmt restricted to internal ip hosts only (cumin and then this proposed host)

I don't think we should have to dedicate any resources to such a lightweight ask when we're already running our own ganeti virtual clusters and/or our own tftp server on our install hosts. I may be overruled on this with mgmt network security concerns.

ACK, sorry, I read " create a ganeti server" as a server running ganeti that then hosts VMs on it.

Yea, so then the biggest part of this would be the "make the mgmt network a network that Ganeti knows about and can be selected by makevm cookbook", I think. That would need ganeti networking work.

@RobH
I have only one question for now. what is or will be your approach on keeping the TFTP server up to date with the latest firmware.

As you said it would be a good idea to see how it fits in the big automation picture.
First by detailing precisely the current workflows, identifying the pain points and designing the ideal workflow.
Especially what can and should be automated, and what should be kept manual.

Note that (A) means mgmt accessing the install servers, not the other way around. So as long as the firmwares are not corrupted, it's fine security wise. That's why I think it's the best short term solution to improve DC-ops experience while we think about a longer term option.

@RobH
I have only one question for now. what is or will be your approach on keeping the TFTP server up to date with the latest firmware.

I make assumptions here, basically that everyone in DC Ops checks the firmware revision before uploading to a host. This is a fairly safe assumption though, no one should be uploading old firmware rather than latest (unless its VERY old and you have to do the major increment updates to get to latest revision.)

These assumptions could be wrong, and then the followup solutions are flawed.

I see a few options here regarding cadence to udpates, but I would hesitate to spin this task off on that discussion.

A) Everyone updates it when they push firmware updates:

Whenever someone in DC Ops goes to flash firmware of a Dell device, we currently go to the Dell website and download the latest firmware. Then it is uploaded to the server via the ilom (either https or tftp from an onsites local laptop). This part wouldn't change, except whoever is doing the firmware update comapare the firmware version on our TFTP server to Dell site, and updates as needed at the time of firmware flashing.

Pro: keeps latest revsions on tftp, doesn't require any single dcopsen to be fully responsible for the firmware updates
Cons: no single owner of firmware revision updates, requires each onsite to regularly check the dell website

B) Single owner updates every X days

One person owns firmware versioning in DC ops for our Dell devices, and updates the tftp server firmware images every X number of days. (30 seems too short, 90 seems too long, 60?)

Pro: One person can maintain and update a documentation page so other DCOpen only have to copy/paste the tftp update commands into idrac ssh.
Cons: One person single point of failure.

C) Dual owners update every X days

We have a 5 person department, so we could assign 2 folks to option B and have them interchange checking for updates every 30 days. I actually like this option best, as it would only put the burden of firmware revision comparisons on a person once every 60 days, but keep all firmware revsions within 30 days of update on our tftp server.

Pro: Only 2 of 5 people have to regularly do firmware tftp updates and documentation updates (keeping copy paste command capability in docs for other dc opsen firmware updates).
Con: I suppose the other 3 dc opsen won't be intimately familiar with the Dell support website, but does that really matter?

I am sure there are other solutions for ownership within DC ops. If we need to determine ownership within our team, I'd suggest we split this particular discussion to its own task, keeping this task focused on the technical issues of the project. (Mostly I don't think the other teams care who in DC ops owns this, so I rather not fill this task conversation thread with this discussion ; )

Edit addition: I've been thinking about this further and IMO the discussion of ownership within DC ops is really ideally suited to our monthly dc ops team meetings!

This comment was removed by RobH.

As you said it would be a good idea to see how it fits in the big automation picture.
First by detailing precisely the current workflows, identifying the pain points and designing the ideal workflow.
Especially what can and should be automated, and what should be kept manual.

Note that (A) means mgmt accessing the install servers, not the other way around. So as long as the firmwares are not corrupted, it's fine security wise. That's why I think it's the best short term solution to improve DC-ops experience while we think about a longer term option.

Should we fine another task to poke the ACL holes and file/directories for A as a sub task from this one do you think or should we allow this to sit a bit more in review before action?

I have some pending installs to do, hence my questions timing =]

RobH renamed this task from allow mgmt network to access tftp servers for firmware updates to Allow idrac tftp fetching of firmware updates (either to existing tftp or new solution).Thu, Jun 10, 3:21 PM

I'm not sure yet how the automation side of things will look like, but there is a good chance that it could use redfish.
In that case it will not use any of the proposed solutions as the Redfish firmware upgrade AFAICT works either POSTing the files to the server via the redfish API (push) or telling the server to download them from an HTTP server (pull).

From IRC conversation:
We're going to do a 1 off to ease DCops pain of upgrading a large amount of firmwares. Once those 40 servers done we will rollback those changes and re-evaluate.

Pushed the following temporary rule to mr1-eqiad:

[edit security address-book global]
     address wikimedia-private_6 { ... }
+    address install1003 208.80.154.32/32;
[edit security policies from-zone mgmt to-zone production]
      policy any--icmp { ... }
+     policy any--tftp {
+         match {
+             source-address any;
+             destination-address install1003;
+             application junos-tftp;
+         }
+         then {
+             permit;
+         }
+     }
RobH added subscribers: jbond, MoritzMuehlenhoff.

@jbond & @MoritzMuehlenhoff:

Would it be ok for me to temp push the Dell firmware files to our install server via puppet volitile file hosting so I can push firmware updates easily for T273915?

Basically I'd just push them into a 'firmware' directory in the tftp store on puppetmaster1001?

Please advise!

@jbond & @MoritzMuehlenhoff:

Would it be ok for me to temp push the Dell firmware files to our install server via puppet volitile file hosting so I can push firmware updates easily for T273915?
Basically I'd just push them into a 'firmware' directory in the tftp store on puppetmaster1001?

Which size are these files? That's fine, if it's not more than say 5 G, the install* servers are 20G instances and e.g. install1003 has 9.9G left (and the others 12-13).

@jbond & @MoritzMuehlenhoff:

Would it be ok for me to temp push the Dell firmware files to our install server via puppet volitile file hosting so I can push firmware updates easily for T273915?
Basically I'd just push them into a 'firmware' directory in the tftp store on puppetmaster1001?

Which size are these files? That's fine, if it's not more than say 5 G, the install* servers are 20G instances and e.g. install1003 has 9.9G left (and the others 12-13).

idrac firmware is 202MB, bios 27MB, network 7MB, they are all quite tiny. There may also (in future) be raid and 10G network, but they are all similar file sizes, quite tiny!

I'll toss them in there and run puppet on the install host to see if it updates accordingly, and then test with one of the new mw hosts!

Which size are these files? That's fine, if it's not more than say 5 G, the install* servers are 20G instances and e.g. install1003 has 9.9G left (and the others 12-13).

idrac firmware is 202MB, bios 27MB, network 7MB, they are all quite tiny. There may also (in future) be raid and 10G network, but they are all similar file sizes, quite tiny!

Go for it then :-)

Change 699269 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] adding mgmt subnet to iptable rules

https://gerrit.wikimedia.org/r/699269

Change 699273 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Allow subset of eqiad mgmt range to connect to tftp servers.

https://gerrit.wikimedia.org/r/699273

Change 699273 abandoned by Cathal Mooney:

[operations/puppet@production] Allow subset of eqiad mgmt range to connect to tftp servers.

Reason:

Myself and Rob both submitted same change mixed up comms.

https://gerrit.wikimedia.org/r/699273

Change 699269 merged by RobH:

[operations/puppet@production] Allow mgmt range to connect to tftp servers.

https://gerrit.wikimedia.org/r/699269

One other option is to use a bash script and a text file with the IP addresses of the nodes (see below)
Note: This was tested only for IDRAC upgrade and I recommend to run it in screen.

  • Lab environment
papaul@install1001:~$ sudo lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster
papaul@install1001:~$ sudo dpkg -l atftpd
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version                     Architecture Description
+++-==============-===========================-============-===========================
ii  atftpd         0.7.git20120829-3.2~deb10u1 amd64        advanced TFTP server
lines 1-6/6 (END)
papaul@install1001:~$ cd /srv/tftpboot/
papaul@install1001:/srv/tftpboot$ ls -l
total 224952
drwxr-xr-x 6 nobody root      4096 Jun  7 19:32 bullseye-installer
drwxr-xr-x 6 nobody root      4096 Jun  4 17:50 buster-installer
-rwxr-xr-x 1 nobody root 114723046 Jun 10 19:46 firmimg.d7
-rwxr-xr-x 1 nobody root 115617117 Jun  3 23:17 initrd.gz

Before the upgrade

papaul@install1001:~$  sudo racadm  -r 10.192.0.50 -u root -p <IDRAC_password> getsysinfo
Firmware Version        = 2.75.100.75
Firmware Build          = 02
Last Firmware Update    = 02/27/2021 03:20:21

After upgrade

papaul@install1001:~$  sudo racadm  -r 10.192.0.50 -u root -p <IDRAC_password> getsysinfo
Firmware Version        = 2.80.80.80
Firmware Build          = 09
Last Firmware Update    = 06/11/2021 04:08:36

The script and the ip_list.txt files need to be in the same directory
update_idrac.sh

#!/bin/bash
# change the password of a DRAC mgmt interface
# (c) Papaul Tshibamba, Wikimedia Foundation Inc. 2021
# Dell's racadm package need to be installed on the server running this script

echo -n "Enter iDRAC root password (password will not be displayed):"
read -s DRACPASS
echo

# Host ip list file location
host_list=ip_list.txt
# Logfile to keep the logs of the execution
execlog=IDRAC_upgrade-exec.log
# Logfile for IP addresses of hosts where the IDRCA upgrade was succesful
successlog=IDRAC_upgrade-success.log
# Logfile for IP addresses of hosts where the IDRAC upgrade failed
faillog=IDRAC_upgrade-failed.log

#get a list of IPs for the servers
function getServers () {
    cat $host_list
}

# check if the file with the hosts IPs exists
if [ ! -r $host_list ]; then
    echo "IP address file $host_list not found or cannot be read"
    exit
# Make a log file. The logfile gets overwritten at each execution
else
    echo "Starting Bash configuration of Dell Drac for file $host_list" >$execlog
    echo "" > $successlog
    echo "" > $faillog
fi

for host_ip in $(getServers); do
    echo "========================================" >> $execlog
    echo "Upgrade IDRAC for  $host_ip " >> $execlog

    # Dell Racadm command
    sudo  racadm -r $host_ip -u root -p $DRACPASS update -f /srv/tftpboot/firmimg.d7

    if [ $? -ne 0 ]; then
        echo "Failed. See logfile for details"
echo $host_ip >> $faillog
    else
        echo "IDRAC upgrade successfully on $host_ip"
        echo $host_ip >> $successlog
    fi
done <$host_list

echo " complete"

ip_list.txt

10.192.0.50
10.192.0.56

when you run the script update_idrac.sh you get the message below

Security Alert: Certificate is invalid - self signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
        The Copying Operation has begun,                                     
	This might take several minutes to complete, depending on the network.
	Do not interrupt the operation.

At this point you should wait that is why you have to run it in screen because it takes like 10 minute to copy the files depending on the network . Once the copy is complete you get this message below

RAC987: Firmware update job for firmimg.d7 is initiated.
This firmware update job may take several minutes to complete depending on the 
component or firmware being updated. To view the progress of the job, use the 
"racadm jobqueue view" command. If the job is scheduled, the system will require a manual reboot.
To reboot the system  manually, use the "racadm serveraction powercycle" command.

IDRAC upgrade successfully on 10.192.0.50

then the script will login the second node and run the same process . While the process is running on the second node, you can open another terminal and run

papaul@install1001:~$ racadm -r 10.192.0.50 -u root -p <IDRAC_password> jobqueue view

output

[Job ID=JID_234004157294]
Job Name=Firmware Update: iDRAC
Status=Downloading
Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
Message=[RED003: Downloading package.]
Percent Complete=[NA]

if the upgrade is done on the first node the same command will show

[Job ID=JID_234004157294]
Job Name=Firmware Update: iDRAC
Status=Completed
Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
Message=[RED001: Job completed successfully.]
Percent Complete=[100]

@RobH what are the conclusions of yesterday's experiment? Is it ok to rollback the network change?

@Papaul that looks useful and could potentially be a cookbook? (relevant doc: https://www.dell.com/support/manuals/fr-fr/dell-xc6420/idrac9_4.00.00.00_ug_new/updating-firmware-using-remote-racadm?guid=guid-64c9bcc6-a701-4d08-949e-753d1e1a7c6f&lang=en-us )

@Papaul that looks like a nice approach. One thing we need to consider @ayounsi is that this makes the connection to the iDRAC from where it's run. So running on the install servers it's connecting from prod -> mgmt. But we could investigate options there of course. We could go further and make a dedicated "controller node" for the iDRAC, and manage it with puppet, running such scripts as this on the back of changes to the repo/netbox etc. Just a thought I've no idea how practical that might be.

I'll let Rob explain all the gory details, but the TFTP update for the iDRAC firmware succeed after a few false starts (iptables on install host had to be updated, issues with the path being requested, needed to use the firmware image extracted from the supplied EXE file not the EXE itself). The bad news is that it seems that the BIOS and other system updates cannot be done from a tftp source, only the firmware. I believe the racadm tool may be able to do it, from a different type of backend (NFS and some others,) so maybe that adds to the case for a dedicated node for these.

From the very original post regarding a tftp server in general i think option B is the better choice. i'm also thinking we may want to investigate option B just for PXE booting servers and most other functions of the instalkl server. Currently as far as i can see the only reason that the install servers have an external address is because they run the SQUID proxy. I think it would make senses form security point of view to move the proxy configuration to a dedicated VM and reconfigure the install server on an internal IP addresses (sorry for the feature creep).

Looking at the example from Paul, it seems that ultimately the command used to run the update is the following one:

$ sudo  racadm -r $host_ip -u root -p $DRACPASS update -f /srv/tftpboot/firmimg.d7

As @cmooney points out this requires a connection from install_server -> mgmt network. However i think we could also use the following to avoid this (this is pretty much the same as the link arzhel posted)

NOTE: some of this commands may need massaging depending on idrac versions etc. Also doesn't help with iLo although suspect there is something similar.
$ sudo  racadm -r $host_ip -u root -p $DRACPASS fwupdate -g -u -a $INSTALL_SERVER_IP -d $TFTP_FOLDER

We may also be able to configure cfgRhostsFwUpdateIpAddr and cfgRhostsFwUpdatePath via the reimage script which would enable us to just use

$ sudo  racadm -r $host_ip -u root -p $DRACPASS fwupdate -g -u

In relation to turning this into a cook book i think there are two options:

  • Install racadm on the cumin servers and run updates from there. this has the down side that we will be pushing firmares from the US to Singapore.
  • Install racadm on the install server and have the cookbooks cal the commands over ssh could possibly create a racadm spicerack module but at that point i wonder if we should just spend time on the redfish implementation

One last thought it would be ideal if we could do this with ipmi as we already have an ipmi spicerack module. however although ipmi delloem looked promising i couldn't actually find anything useful

From the very original post regarding a tftp server in general i think option B is the better choice. i'm also thinking we may want to investigate option B just for PXE booting servers and most other functions of the instalkl server. Currently as far as i can see the only reason that the install servers have an external address is because they run the SQUID proxy. I think it would make senses form security point of view to move the proxy configuration to a dedicated VM and reconfigure the install server on an internal IP addresses (sorry for the feature creep).

That makes sense. We had unrolled the install* server already quite a bit when the pop setup was created, but that would be next logical step. I can look into creating separate proxies with bullseye in the next weeks, then it would only be a matter of reimaging the install servers with an internal hostname.

Ok so in testing yesterday we got the idrac firmware to load over TFTP, but it seems they don't support TFTP for DUP files like Bios, NIC, Raid, only ftp for those, and TFTP for the idrac firmware.

So it appears this may need to shift from a stand alone tftp server to a stand alone FTP server. I'll update more later today, but wanted to get this basic update pushed from the testing yesterday.

Change 699323 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] installserver/tftp: install tftp client on tftp servers for debugging

https://gerrit.wikimedia.org/r/699323

Change 699323 merged by Dzahn:

[operations/puppet@production] installserver/tftp: install tftp client on tftp servers for debugging

https://gerrit.wikimedia.org/r/699323

Change 699301 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] Revert "Allow mgmt range to connect to tftp servers."

https://gerrit.wikimedia.org/r/699301

Change 699301 merged by Dzahn:

[operations/puppet@production] Revert "Allow mgmt range to connect to tftp servers."

https://gerrit.wikimedia.org/r/699301

I reverted the firewall (ferm) change that allowed mgmt to connect to install since as comments above say it couldn't be used for firmware upgrades.

Papaul's suggestion "just works" and could be run from cumin servers which already have access to the mgmt network so that seems a good option to me.

With the same lab environment, the same command will upgrade the BIOS just by changing the file.

sudo racadm  -r 10.192.0.50 -u root -p <IDRAC_password> update -f /srv/tftpboot/BIOS_9FG85_WN64_2.12.1_01.EXE
Security Alert: Certificate is invalid - self signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
        The Copying Operation has begun,                                     
	This might take several minutes to complete, depending on the network.
	Do not interrupt the operation.
^CRAC987: Firmware update job for BIOS_9FG85_WN64_2.12.1_01.EXE is initiated.
This firmware update job may take several minutes to complete depending on the 
component or firmware being updated. To view the progress of the job, use the 
"racadm jobqueue view" command. If the job is scheduled, the system will require a manual reboot.
To reboot the system  manually, use the "racadm serveraction powercycle" command.

could be run from cumin servers which already have access to the mgmt network so that seems a good option to me.

Im not sure running theses from the cumin servers is the best idea long term as doing remote site will likley take some time too upload the files. I know theses files are not that big but my previous experience (@Papaul correct me if i'm wrong) uploading to drac interfaces is very slow, as such upgrading a box in eqsin from eqiad may be too painful.

@jbond Yes uploading to IDRAC interfaces is very slow.