Page MenuHomePhabricator

Multiple servers in codfw fail to respond to IPMI commands during reimaging
Closed, ResolvedPublic

Description

Several mw* servers in codfw failed to reimage. It throws errors like:

Error: Unable to establish IPMI v2 / RMCP+ session
Error setting Chassis Boot Parameter 0
Error setting Chassis Boot Parameter 4

Affected hosts so far:
mw2086, mw2087,mw2102,mw2148, mw2149, mw2150, mw2151

"racadm config -g cfgIpmiLan -o cfgIpmiLanEnable" didn't make a difference and Papaul also checked that IPMI should be enabled.

Event Timeline

RobH created this task.Aug 11 2016, 4:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 11 2016, 4:39 PM
RobH added a subscriber: faidon.Aug 11 2016, 4:43 PM

Additionally, I wasn't aware this reimage script used IPMI.

We had to disable the IPMI on the ilom interfaces across the Dell fleet over a year ago due to some security flaw. (My memory may not be correct, and it could have been 2 years ago.)

It seems now these scripts require IPMI. Are we ok to turn it back on fleet wide?

If I recall correctly, @faidon may have been involved in the discussion to disable it year(s) ago?

Basically we should decide if it can be on for all machines, and then ensure its applied in a consistent manner.

It seems to be enabled for least a range of servers, though? Luca has been reimaging several other mw* systems successfully with the wmf-reimage (i.e. it worked there).

Joe added a subscriber: Joe.Aug 12 2016, 9:11 AM

Most of the mw* systems have IPMI enabled and always had; I was unaware of this "security flaw" in IPMI and I honestly didn't see any discussions since I joined 2 years ago about it.

We should most surely review what is the issue - if one exists - and act accordingly.

Pretty sure @RobH is referring to the IPMI cipher 0 vulnerability.

This was fixed across the fleet at the time by disabling cipher 0 (not disabling IPMI in general). This was fixed by the vendors too, so servers with newer firmware (servers procured in, say, the last 1-2 years) shouldn't be vulnerable to this vulnerabiity at all.

RobH added a comment.Aug 12 2016, 3:31 PM

Pretty sure @RobH is referring to the IPMI cipher 0 vulnerability.
This was fixed across the fleet at the time by disabling cipher 0 (not disabling IPMI in general). This was fixed by the vendors too, so servers with newer firmware (servers procured in, say, the last 1-2 years) shouldn't be vulnerable to this vulnerabiity at all.

Yes, that was likely it. Thanks for the feedback! We'll just start ensuring IPMI is enabled on all servers going forward.

Now back to these two particular systems. @Papaul told me in IRC that he has enabled IPMI via their bios, but the commands still don't work. I'll double check his work, and if it all checks out, we likely have some malfunctioning hardware. We'll see shortly.

Status update: mw2088/2089 (which are identical hardware) worked fine and I re-tried mw2087, but still to no avail. Maybe this is limited to a few hosts after all.

MoritzMuehlenhoff renamed this task from mw2086 & mw2087 do not respond to IPMI commands to Multiple servers in codfw fail to respond to IPMI commands during reimaging.Aug 30 2016, 9:30 AM
MoritzMuehlenhoff updated the task description. (Show Details)

@Papaul: Given that this is affecting seven hosts as of now, this is likely not a hardware error, but probably some kind of configuration setting we're missing. Do we have the possibility to open a support ticket at Dell about this?

@MoritzMuehlenhoff I can open a support ticket with Dell

@Papaul : mw2088 is working fine, mw2148 is not working. Both are powered off and depooled.

@MoritzMuehlenhoff Thanks will start working on it.

@MoritzMuehlenhoff please see below for the comparison table between the 2 systems. the only thing that i can see that can cause this problem is
1- IDRAC IPMI over LAN is not enable on mw2148
2 - IDRAC Firmware version needs to be update on mw2148

I will need a third none working host to confirm settings

Optionsmw2088mw2148
Bios version1.4.62.3.3
Memory Mapped I/O above 4 GBDisableenable
IDRAC settings version1.301.60
IDRAC Firmware Version2.211.57
IDRAC IPMI Over LANenabledisable

Where is that "IDRAC IPMI Over LAN" setting coming from, in idrac or some other config tool? Can you enable "IDRAC IPMI over LAN", so that I can test whether that fixes it? If not, we can test a firmware upgrade in a followup step.

the left window is mw2088 and the right windows is mw2148. I enable the setting on mw2148.

Thanks, I'll try an install on mw2148. If it doesn't work, we can narrow that down further with a third host.

@Papaul, that setting fixed it, mw2148 could now be reimaged. During the reimaging we'll likely run into further servers with the same problem, so let's use this ticket to track/fix them.

@Papaul: Do the servers need to be powered down to change the setting or can you change that for running hosts as well?

@Papaul, @Cmjohnson : I guess you have some kind of checklist for racking new hardware? Could you add an item to check that new servers always have "IDRAC IPMI Over LAN" enabled?

Papaul added a comment.Sep 1 2016, 2:14 PM

@MoritzMuehlenhoff I think we need to power down the hosts. For also to be on the safe side to make sure that the settings are saved a applied by the hosts I recommend that we do a reboot after changing the setting.

Yes I can add enable IPMI over LAN on new servers.

Mentioned in SAL [2016-09-01T14:40:43Z] <moritzm> powered down several hosts for hardware maintenance (T142726): mw2087, mw2149-mw2151

Mentioned in SAL [2016-09-01T14:58:33Z] <moritzm> powered down several hosts for hardware maintenance (T142726): mw2099, mw2102, mw2117, mw2163-mw2199

Hi Papaul, first batch: These are depooled from the cluster and powered down:

mw2087
mw2099
mw2102
mw2117
mw2149-mw2151
mw2163-mw2199

Mentioned in SAL [2016-09-06T07:40:25Z] <moritzm> shutting down mw2140-mw2214 for hardware maintenance (T142726)

Mentioned in SAL [2016-09-06T07:41:30Z] <moritzm> correction: shutting down mw2140-mw2147 and mw2200-mw2214 for hardware maintenance (T142726)

Mentioned in SAL [2016-09-06T09:12:27Z] <moritzm> shutting down mw2153-mw2162 for hardware maintenance (T142726)

Here's the second batch of machines to fix. They are depooled from the cluster and powered down. Please fix the IPMI setting and power them up when done:

mw2140-mw2147
mw2200-mw2214
mw2153-mw2162

And could you please doublecheck the setting on mw2170? It still failed to accept IPMI commands, while all the others from the first batch worked fine. Maybe it needs a firmware update similar to mw2087.

Papaul added a comment.Sep 6 2016, 2:57 PM

I checked the settings on mw2170, I disable and enable again the IPMI over LAN settings, you can try again.

@MoritzMuehlenhoff second batch complete.

Mentioned in SAL [2016-09-07T14:16:40Z] <moritzm> shutting down mw2120-mw2139 for hardware maintenance (T142726)

Mentioned in SAL [2016-09-07T15:02:09Z] <moritzm> shutting down mw2080-mw2085 for hardware maintenance (T142726)

Thanks, third batch:
mw2120-mw2139
mw2080-mw2085

Thanks, third batch:
mw2120-mw2139
mw2080-mw2085

I've en-enabled mw2080, please skip that one, we can deal with it on Friday, when there's no deployment, so just mw2120-mw2139 and mw2081-mw2085

Mentioned in SAL [2016-09-08T13:43:15Z] <moritzm> powering down mw2075-mw2079 for hardware maintenance (T142726)

We're closing in, fourth batch: mw2075-mw2079

Papaul added a comment.Sep 8 2016, 2:53 PM

Complete. IPMI over LAN was already enable on all 5 hosts.

@Papaul would you mind to check mw2075? I wasn't able to run wmf-reimage because of IPMI errors :(

Papaul added a comment.Sep 9 2016, 1:59 PM

IPMI over LAN is enable on mw2075.

MoritzMuehlenhoff closed this task as Resolved.Sep 12 2016, 2:52 PM
MoritzMuehlenhoff claimed this task.

All mw* systems in codfw have been fixed by Papaul, closing the ticket.