Page MenuHomePhabricator

Upgrade BIOS/RBSU/etc on lvs1007
Closed, DeclinedPublic

Description

Chris - Can you upgrade the RBSU (and whatever other firmware) to the latest rev on lvs1007 please?

Context - I'm having issues getting the network adapter to work on this host (and others like it, but experimenting on just this one for now). @fgiunchedi managed to get past the same issue by disabling "HP Shared Memory Features" for the NIC(s) somewhere in the RBSU menus. However, on these hosts there is no menu-based RBSU on the serial console (apparently it varies by model and rbsu firmware rev, etc), only the CLI with the SET/SHOW commands. There's no such option available in the CLI stuff. There are hints online that later versions of RBSU might either have the CLI option for this or enable the menu-based RBSU with fuller control. I've already tried disabling Intel VT-d (via RBSU CLI) and SR-IOV (via NIC Ctrl+S setup) but that's apparently not enough to get past this ethernet driver crash issue.

This is a deep rabbithole problem that's blocking a lot of other work indirectly (e.g. decomming old LVSes in eqiad, and blocking indirectly the ulsfo hardware deploys and asia cache DC as well).

Details

Related Gerrit Patches:
operations/puppet : productionlvs: rename lvs1007 eth interfaces

Event Timeline

BBlack created this task.Jun 7 2017, 2:05 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJun 7 2017, 2:05 PM
ema moved this task from Triage to LoadBalancer on the Traffic board.Jun 12 2017, 9:53 AM
RobH added a subscriber: RobH.Jun 26 2017, 9:10 PM

So attempting to upload this over the web to the gui mgmt interface times out. It may work a bit better if done locally from eqiad. I'm pushing the file to my home directory on iron, so Chris can pull it from there.

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jun 29 2017, 3:13 PM
ayounsi moved this task from Backlog to Watching on the netops board.Jul 12 2017, 7:20 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-25T15:31:40Z] <cmjohnson1> updating firmware lvs1007 T167299

The bios update that I have has failed to install....looking at another solution.

RobH claimed this task.Aug 30 2017, 12:14 AM
RobH added a subscriber: Cmjohnson.

I've contacted Dasher about this system failing to take updates, will update task when I have more.

RobH added a comment.Aug 31 2017, 12:37 AM

So, the bios and ilom have been updated to the latest version, but the firmware on the NIC can only be flashed via the HP SPP iso boot.

Created the bootable img using the HP utility provided in the iso. It is a Windows software and had to borrow from a family member. Booted the Service pack and the updated the needed f/w. Server rebooted @RobH please verify

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201709121519_bblack_22015.log.

RobH reassigned this task from RobH to Cmjohnson.Sep 12 2017, 4:41 PM

Ok, now we are in a bad state. We are trying to remotely enter the Bios, and get the following when telling it to enter bios on vsp:

ROM-Based Setup Utility initiated via a local key press,
please use the local monitor for graphical support.
BIOS Serial Console has been disabled until the next boot.

We need to have @Cmjohnson connect a crash cart, enter the GUI bios setup, and confirm the following:

09:39 < bblack> : somewhere in bios setup, there's an option (related to NICs? but I have no idea what menu it might be under) named "HP Shared Memory Features"
09:39 < bblack> : which needs to be disabled

Also please double check all serial redirection settings, so it redirects to 2, sets physical com port to 1, and the like. We should be able to load the graphical bios settings screen via vsp if all settings are correct.

Also, the NIC firmware update only applied to ports 2+3, but not ports 0+1. I don't suspect NIC firmware level was a leading candidate for the fix anyways, but having the ports at different firmware levels sounds particularly problematic on its own.

16:01 < bblack>      | <03:00:00> BCM57810 - EC:B1:D7:7B:C6:D8 MBA:v7.10.71 CCM:v7.10.71 |      
16:01 < bblack>      | <03:00:01> BCM57810 - EC:B1:D7:7B:C6:DC MBA:v7.10.71 CCM:v7.10.71 |      
16:01 < bblack>      | <04:00:00> BCM57810 - 8C:DC:D4:0C:97:D8 MBA:v7.14.10 CCM:v7.14.4  |      
16:01 < bblack>      | <04:00:01> BCM57810 - 8C:DC:D4:0C:97:DC MBA:v7.14.10 CCM:v7.14.4  |

Completed auto-reimage of hosts:

['lvs1007.eqiad.wmnet']

Of which those FAILED:

set(['lvs1007.eqiad.wmnet'])

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201709121923_bblack_31674.log.

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['lvs1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201709121927_bblack_32678.log.

RobH claimed this task.EditedSep 18 2017, 4:54 PM

Ok, since the firmware updates, the host lvs1007 won't pxe boot.

Ill work with Chris as needed, and once PXE booting works, will kick it back to @BBlack.

RobH reassigned this task from RobH to Cmjohnson.Sep 18 2017, 5:19 PM

This is an HP gen8, so I cannot actually load the bios remotely and check the PXE settings for the cards. This issue sounds like the NIC card bios doesn't ahve PXE enabled for the nic port plugged into the network switch.

Please investigate and check bios settings for the card, and confirm if they needed tweaking.

I did the NIC card bios check last week when I first found the PXE booting problem. It is enabled there. My guess is either something else in BIOS settings got changed that affects this, or somehow disabling "HP Shared Memory Feature" with the new NIC firmware also kills the PXE functionality indirectly (which would be pretty awful, and might mean we have no solution but to move to stretch installs).

I checked the BIOS Settings, everything is enabled, the standard boot order is correct
1, CDROM

  1. FLOPPY
  2. USB
  3. HARD DRIVE
  4. PCI SLOT 1 ETHERNET 10Gb

I also verified PXE was still selected on the NIC card

RobH added a comment.Sep 19 2017, 3:10 PM

@BBlack: So to confirm, if we disable the memory share, it won't pxe boot. If we enable it, it will pxe boot?

I don't know, I hadn't tried re-enabling the memory sharing stuff. All I really know is the sequence of events last week was approximately:

  1. It was PXE booting to installer, but failing network config inside the installer, with google hits saying to disable the HP Shared Memory Feature for the NICs to fix
  2. DCOps finished up going through BIOS and also disabling that HP Shared Memory Feature
  3. I went back to the host, confirmed NIC setting disabling the HP Shared Memory Feature (and NIC setting for PXE boot still on).
  4. Tried PXE boot to installer (a few different ways - "One Time Boot Option" manually from RBSU, or IPMI from wmf-reimage, etc), but it refuses to PXE and just boots from disk now.

My suspicion based on the above, is that for some convoluted reason, "HP Shared Memory Feature" being enabled might be a pre-requisite for RBSU+NIC to work together on PXE booting?

@BBlack @RobH I went through and re-verified all the settings, this generation does not give an option of UEFI or Legacy like the new generations. The bios is very basic. I enabled the HP Memory Share. Let's see if that works

RobH added a comment.EditedSep 19 2017, 4:33 PM

It looks like these are different versions though:

| <03:00:00> BCM57810 - EC:B1:D7:7B:C6:D8 MBA:v7.10.71 CCM:v7.10.71 |      
| <03:00:01> BCM57810 - EC:B1:D7:7B:C6:DC MBA:v7.10.71 CCM:v7.10.71 |      
| <04:00:00> BCM57810 - 8C:DC:D4:0C:97:D8 MBA:v7.14.10 CCM:v7.14.4  |      
| <04:00:01> BCM57810 - 8C:DC:D4:0C:97:DC MBA:v7.14.10 CCM:v7.14.4  |

Also, it seems eth0 was set to a different multi-function mode than the others:
Multi-Function Mode : NPAR1.5
While the other ports (which showed in pxe splash attempts were set to:
Multi-Function Mode : SF

I think @Cmjohnson said before that they're at different revs because they're different pieces of hardware (onboard vs card), and those are the latest revs for each, respectively.

RobH added a comment.Sep 19 2017, 4:48 PM

Well, my thought about the multi function mode being set incorrectly doesnt work. I make it match all the other ports and it still doesn't pxe boot on eth0. Setting memory enabled on all ports to see if it pxe boots.

RobH added a comment.EditedSep 19 2017, 5:10 PM

So, PXE isn't working now for eth0, mac address ec:b1:d7:7b:c6:d8. This is the eth0 that is also detected in the OS, so it doesn't appear to be an issue where the BIOS sees one port as eth0, but the OS sees another. Both see ec:b1:d7:7b:c6:d8 as eth0.

All ports have had the memory sharing set to both disabled and enabled, with attempts to PXE boot via eth0 under each setting. It never shows the actual PXE boot splash and attempt to contact dhcp, instead it skips right to disk.

If eth3 (the second NIC) has its boot mode set to pxe, it will show a pxe dhcp attempt. It is odd that eth0 suddenly will not.

RobH added a comment.Sep 19 2017, 6:12 PM

Ok, so there is something wrong with lvs1007 network firmware/settings.

lvs1007 boot order is missing the network device option. When selecting it in the one time option, it never actually shows a PXE/DHCP post screen, and bypasses it to the disks. It should show up in the normal boot order menu, that it isn't seems to denote that it isn't being detected. Perhaps the firmeware load is in a bad state or failed. @faidon suggests we attempt the firmware load a second time on both NICs in lvs1007 to see if it resolves the issue.

If not, we'll have to have Dasher/HP send a tech out to work onsite with @Cmjohnson to resolve this issue.

As you can see on lvs1008, which has not had firmware updated, it shows the network device 1 in the boot order, as well as one time options.

RobH added a comment.EditedSep 19 2017, 6:57 PM

Per @BBlack's request, I've done a show config script on both lvs1007 and lvs1008 for comparison.

P6026 shows both.

The only difference is the virtualization is flipped on (immaterial) and lvs1007 lacks the line: SET IPL PXE 5

Googling SET IPL PXE 5 shows a lot of issues where folks lack PXE booting, even when it should do so. Nothing seems directly relevant to our specific issue, since they have that line in their config and we lack it on lvs1007 for some reason.

Can we reset lvs1007 to all factory defauts and see if it has the nic show up as a boot option then? (We're not sure what got changed over time, Brandon suggests that something was different in bios, and we can see that lvs lacks 'SET IPL PXE 5'

For the record, one cannot simply enter 'SET IPL PXE 5' in rbsu it doesn't like it: (105):Invalid Parameter(s) - Invalid String specified

I think the SET IPL PXE 5 is also a red herring, as it usually seems to mean pxe is enabled on a port other than eth0. Our issue is the network device doesn't show in the boot order menu at all!

I suggest resetting bios to defaults and see if that fixes it. If not, attempt reflash of the nic firmware as well as bios, and see if that resolves it. If that doesn't, we'll need to bring in Dasher/HP.

RobH added a comment.Sep 19 2017, 7:20 PM

IRC update:

@Cmjohnson went ahead and reset bios settings to defaults, and after power cycling the server, it hasn't resolved the network device not showing in the boot order.

Next step is to reflash the BIOS firmware as well as both NIC firmwares to the newest versions (a second time, even though they may already be on them) to ensure nothing screwed up during that process.

Swapped the NIC card with a new one that HP sent.

Still says 101-I/O ROM Error twice on every boot attempt, new NIC card has older firmware. PXE boot still doesn't work (tried setting Boot Strap Type to int19h in the ethernet card's firmware menu for eth0 as well just in case it was some BBS-specific failure, no dice).

I gave in and tried a stretch network install on lvs1009 for comparison. I didn't make any bios/firmware changes there, just used RBSU console to onetimeboot netdev1, power reset, vsp and watched the installer. Hardware DHCP->PXE worked and loaded the installer, but the installer failed on network stuff (like jessie). Installer dmesg shows multiple driver panic on bnx2x . I just copied the end of one crashdump and start of the next for the metadata here:

[   45.046695] bnx2x: [bnx2x_mc_assert:750(eno1)]Chip Revision: everest3, FW Version: 7_13_1                                                                    
[   45.046696] bnx2x: [bnx2x_panic_dump:1186(eno1)]end crash dump -----------------                                                                             
[   45.046721] bnx2x: [bnx2x_attn_int_deasserted2:4251(eno1)]FATAL HW block attention set2 0x20                                                                 
[   45.046722] bnx2x: [bnx2x_attn_int_deasserted2:4252(eno1)]driver assert      
[   45.046723] bnx2x: [bnx2x_panic_dump:923(eno1)]begin crash dump -----------------

I figured as a next minimal testing step on lvs1009, should just go into the ethernet firmware (Ctrl+S) and try disabling SR-IOV and/or HP Shared Memory Features, without any upgrades, if possible. However, the firmware/bios levels currently on this host doesn't work well enough with VSP to do that remotely (it doesn't process arrow keys correctly, you can only drill into the first item of each menu and toggle whatever's there...).

Got arrow keys working in Ctrl-S (thanks @fgiunchedi !) by re-setting the local terminal. There is no "HP Shared Memory Features" prompt in the current NIC firmware to disable. Went ahead and disabled SR-IOV on both cards and tried another netboot, still fails to bring up the interface in the stretch installer as before.

BBlack moved this task from LoadBalancer to Hardware on the Traffic board.Oct 23 2017, 5:53 PM
Cmjohnson lowered the priority of this task from High to Low.Nov 28 2017, 6:22 PM

Change 402859 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] lvs: rename lvs1007 eth interfaces

https://gerrit.wikimedia.org/r/402859

Change 402859 abandoned by Ema:
lvs: rename lvs1007 eth interfaces

Reason:
Fixed in /etc/udev/rules.d/70-persistent-net.rules instead.

https://gerrit.wikimedia.org/r/402859

BBlack closed this task as Declined.Feb 27 2018, 1:37 PM

Gave up on these machines!