Page MenuHomePhabricator

Confirm support of PERC 750 raid controller
Closed, ResolvedPublic

Description

Dell is forcing an update off the older PERC 730 controller up to the new H750 controller.

https://www.dell.com/en-sg/work/shop/perc-h750-adapter-low-profile-full-height/apd/405-abce

In the past, when we shifted to the H740, we had driver issues. Since we'll have to update to the H750, we should have driver support confirmed.

LINUX PERCCLI Utility For All Dell HBA/PERC Controllers: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=j91yg

Testing Details

2022-04-21 Rob successfully tested on dumpsdata1007 with "PercCli SAS Customization Utility Ver 007.1910.0000.0000 Oct 08, 2021" on 5.16.11-1~bpo11+1 - raid1 OS array was setup for installation, had a disk removed while the OS was live (after being set to offline), and then had it installed and auto rebuilded back to optimal status. - cadence is to set it to offline which now handles missing/spindown and no need to do those steps.

Related Objects

StatusSubtypeAssignedTask
Resolved MoritzMuehlenhoff
ResolvedJgreen
Resolved Cmjohnson
ResolvedPapaul
ResolvedMarostegui
ResolvedMarostegui
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedMarostegui
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
ResolvedRequestPapaul
Resolved Cmjohnson
DuplicateMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolved Cmjohnson
Resolved Cmjohnson
Resolved Cmjohnson
ResolvedBTullis
Resolved MoritzMuehlenhoff
ResolvedArielGlenn
ResolvedRobH
ResolvedRobH
Resolved MoritzMuehlenhoff

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

dumpsdata1007 is now running 5.16.11, can you please retest?

I'm not familiar with perccli myself, if there's any uncertainties with the docs/setup let's clarify with Dell?

@RobH - Sorry to hijack this thread. Do you happen to know whether the PERC H750 will support JBOD mode?
In the past we've had to use single-drive RAID0 volumes to get the same effect. It works, but it's a bit of extra overhead, so JBOD support could bring a substantial benefit.
Thanks.

RobH mentioned this in Unknown Object (Task).Mar 22 2022, 7:50 PM
RobH mentioned this in Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).
RobH mentioned this in Unknown Object (Task).Mar 22 2022, 8:20 PM
RobH mentioned this in Unknown Object (Task).
RobH changed the task status from Open to In Progress.Mar 23 2022, 5:53 PM

dumpsdata1007 is now running 5.16.11, can you please retest?

I'm not familiar with perccli myself, if there's any uncertainties with the docs/setup let's clarify with Dell?

Now I'm not seeing the controller at all, where before it would see it:

robh@dumpsdata1007:~$ perccli show
CLI Version = 007.0529.0000.0000 Sep 18, 2018
Operating system = Linux 5.16.0-0.bpo.3-amd64
Status Code = 0
Status = Success
Description = None

Number of Controllers = 0
Host Name = dumpsdata1007
Operating System  = Linux 5.16.0-0.bpo.3-amd64
StoreLib IT Version = 07.0600.0200.0600
StoreLib IR3 Version = 16.03-0


robh@dumpsdata1007:~$ perccli /c0/dall show
CLI Version = 007.0529.0000.0000 Sep 18, 2018
Operating system = Linux 5.16.0-0.bpo.3-amd64
Controller = 0
Status = Failure
Description = Controller 0 not found

So I guess this kernel change broke it entirely?

So I guess this kernel change broke it entirely?

No, you were using the wrong command :-)

"perccli" is a 32 bit binary and while it can be executed by our amd64 setup, I think it's looking in incorrect run time paths. You need to use "perccli64" instead, then the controller is visible for me.

Raid testing.

I can poll the controller for basic info: root@dumpsdata1007:~# perccli64 /c0/dall show
and get BBU info: perccli64 /c0/bbu show all
perccli64 /c0/d0 show

I don't get how to poll for the virtual disk IDs in the references, it assumes you know them?

root@dumpsdata1007:~# perccli64 /c0/dall show
CLI Version = 007.0529.0000.0000 Sep 18, 2018
Operating system = Linux 5.16.0-0.bpo.3-amd64
Controller = 0
Status = Success
Description = Show Diskgroup Succeeded


TOPOLOGY :
========

-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT       Size PDC  PI SED DS3  FSpace TR 
-----------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  446.625 GB dflt N  N   dflt N      N  
 0 0   -   -        -   RAID1 Optl  N  446.625 GB dflt N  N   dflt N      N  
 0 0   0   64:12    0   DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
 0 0   1   64:13    4   DRIVE Onln  N  446.625 GB dflt N  N   dflt -      N  
-----------------------------------------------------------------------------

DG=Disk Group Index|Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|Onln=Online|Rbld=Rebuild|Dgrd=Degraded
Pdgd=Partially degraded|Offln=Offline|BT=Background Task Active
PDC=PD Cache|PI=Protection Info|SED=Self Encrypting Drive|Frgn=Foreign
DS3=Dimmer Switch 3|dflt=Default|Msng=Missing|FSpace=Free Space Present
TR=Transport Ready




root@dumpsdata1007:~# perccli64 /c0/v0 show
CLI Version = 007.0529.0000.0000 Sep 18, 2018
Operating system = Linux 5.16.0-0.bpo.3-amd64
Controller = 0
Status = Failure
Description = None

Detailed Status :
===============

----------------------------------
VD Status ErrCd ErrMsg            
----------------------------------
 0 Failed   255 Invalid Vd number 
----------------------------------



root@dumpsdata1007:~#

I'm unable to get the disk to go into missing to spin down, spin back up, and set to returned to test rebuilding an array. I can set it to offline, and thats about it.

Also unable to determine how to poll for virtual disk IDs, other htan dropping into raid bios, which won't work out for production. I need to keep hammering away at this, but could use some help if anyone is familiar at all with perccli64.

Also unable to determine how to poll for virtual disk IDs, other htan dropping into raid bios, which won't work out for production. I need to keep hammering away at this, but could use some help if anyone is familiar at all with perccli64.

Can't we reach out to Dell? I mean, they are the ones who discontinued our existing/perfectly fine working controller and they are the ones who want to sell us something...

RobH added a subscriber: Papaul.

Ok, I dug in some more and I've gotten some success, but not enough. I'm wondering if @Papaul may have some time to review this as well. I've attached the copy of the Dell guide I downloaded

I can do the following successfully on dumpsdata1007:

perccli64 show
perccli64 /c0/dall show #shows all raid disks info
perccli64 /c0/vall show #shows all virtual drive info
perccli64 /c0/eall show #shows all enclosure info
perccli64 /c0/e64/s13 set offline # works to set a drive offline

Then I run into issues attempting to set the disk to missing, so it can be safely removed for a (simulated) swap with the spindown command

perccli64 /c0/e64/s13 set missing # does not work to set a drive missing
perccli64 /c0/e64/s13 spindown # cannot get it to spin down or spinup

If Papaul cannot see what I'm missing (He has worked with some more troublesome configs with networking, so he may see what I'm missing here), then I'll have to schedule some time with Dell team for support.

@Papaul, Do you have some time this week to take a glance at dumps1007 and see if you can figure out how to use the command line to set a disk to missing for removal, spin it down, spin it back up, mark it back as available and add back into the raid array? We need to see the array in a depreciated/failed disk state and then return to a normal state.

@RobH I took a quick look at this yesterday, no luck. Since it is a new product i will recommend getting Dell help maybe this will save us time.

Emailed our Dell team with our issues, will update as they respond.

Dell Team,

We're currently attempting to get the new raid controllers for function for us so we can unblock and order a few dozen (or more) servers using it. However, I'm running into issues attempting to set a drive to missing to spin it down, remove it, and replace it in a simulated disk failure.

Before we push this controller into production, we have to be able to successfully do the following:

  • set a disk to offline, missing, spin it down, remove it, add it back in, spin it back up, set online, add back into the array, rebuild the array
  • create a raid[1|6|10] array via command line

I'm stuck on the setting it to missing onward for a 'failed disk' replacement test.

I've attached a text file of my output for info on the host and the working and failed commands. Can we get this escalated so I can have someone work with me to figure out what I'm doing wrong? I'm hoping with the info gathering commands in the text file, you'll be able to see what I'm doing wrong in command syntax and offer the correct command line examples.

Thanks in advance!

RobH changed the task status from In Progress to Open.Mar 30 2022, 5:18 PM

@MoritzMuehlenhoff.

Dell is stating they want us to upgrade to the newest revision of the utility for them to offer support. I'm pushing back, but they state we should be using:

Version 7.1623.0.0, A11, Release date:9/10/2021.
LINUX PERCCLI Utility For All Dell HBA/PERC Controllers: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=j91yg
Release notes: https://dl.dell.com/FOLDER07575976M/1/PERCCLI_7.1623.00_A11_Release_Notes.txt

Can we get this put on dumpsdata1007 for testing? (I don't want to simply load it up without any kind of secondary security review.)

This was updated, same issue on dumpsdata1007 and sent info to our Dell team.

Dell suggestion some alternate arguments for the command line utility that didn't work, and then requested we open a case for them to escalate

Service Request 1090168698

Sent case # to our team to escalate.

RobH added a subtask: Unknown Object (Task).Apr 12 2022, 7:13 PM

Summary update from multiple out of band support emails and conversations with our Dell account team:

  • confirmed that the missing/spin down doesn't work for them either, and chipset manufacturer is now involved to determine when and why the feature was disabled/removed
  • dell is aware of our test workflow and our need to do the following:
    • use perccli to add/remove/swap failed disks in an array
    • use perccli to setup new raid arrays and edit existing in raid[1|6|10]
    • use perccli commands to monitor raid health

They'll attempt to get an update to us ASAP.

Current plan is to escalate all pending H750 orders by end of this month.

Ok, next steps:

I've set the disk to identify flash with: perccli64 /c0/e64/s13 start locate

root@dumpsdata1007:~# perccli64 /c0/e64/s13 start locate
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.16.0-0.bpo.3-amd64
Controller = 0
Status = Success
Description = Start Drive Locate Succeeded.

One of the two SSDs should now be flashing, it should be 'sdb' of the two so you can pop it out and ensure the host is still online (or ping me to do so) and once we confirm it stays online during this disk removal, push the disk back into the host and I'll take back over rebuilding the array.

perccli64 /c0 show all also shows a physical disk list, we'll want to run to confirm it sees the disk gone when removed.

RobH added a subscriber: Cmjohnson.

Update: Chris pulled the offline SSD and I confirmed OS saw it go away, then after 5 minutes put it back into place and the system detected it and started an automatic rebuild of the array. Once that completes successfully we'll consider a huge blocker gone on this for this distro/version combination and can expand to testing others.

RobH updated the task description. (Show Details)

Todo:

  • test on other distros we use
  • get partman to work with this, as our existing recipes expect the flexbays to be the SDA virtual drive and the new controller always puts them at a higher ID # so they show up after the HDDs.
  • get monitoring updated to leverage perccli64 to use for icinga monitor of raid health

That's promising progress! I have rebooted dumpsdata1007 back into Linux 5.10. This is the standard kernel we're running on Bullseye (we can ignore Buster/Stretch for testing the PERC 750 controller). Can you please re-test the RAID setup and failure scenario?

The "get monitoring updated to leverage perccli64 to use for icinga monitor of raid health" can be handled by SRE IF, simply create a separate Phab task for it once we've established that PERC 750 works for us.

RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.Apr 25 2022, 9:32 PM
RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.
RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.Apr 26 2022, 9:37 PM

I am seeing the procurement tasks being processed already, does that mean we have established that this controller will work 100% for us then?

I am seeing the procurement tasks being processed already, does that mean we have established that this controller will work 100% for us then?

I think Rob's tests with 5.16 and 5.10 were fine, so that means it's supported by the default Bullseye kernel. Unlikely by older OSes (which weren't even tested), but we decided not to care about them.

There are still various followups. At least:

  • there's some work needed for Partman if I'm not mistaken (I lack the details, but Rob mentioned it on IRC)
  • we need to figure out with Legal if we can add perccli64 to apt.wikimedia.org (or failing that we might need to consider an additional internal repo)
  • perccli64 needs a deb package
  • we need to adapt monitoring to also support perccli64 in the RAID checks (and some Puppet integration like the raid Puppet fact)

But given the long lead times of server purchases I'd be cautiously optimistic that all of this is done when the servers arrive.

RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.Apr 29 2022, 7:08 PM
RobH added a subtask: Restricted Task.May 3 2022, 5:21 PM
RobH closed subtask Restricted Task as Declined.May 12 2022, 8:06 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.May 17 2022, 6:04 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.May 17 2022, 6:08 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.May 17 2022, 7:41 PM
Papaul closed subtask Unknown Object (Task) as Resolved.May 23 2022, 3:29 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jun 2 2022, 11:52 PM

@jbond note to self, look at extending raid fact to support new controller

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jun 28 2022, 9:23 PM

Change 809602 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the partman recipe for use with the new stat servers

https://gerrit.wikimedia.org/r/809602

Change 809602 merged by Btullis:

[operations/puppet@production] Update the partman recipe for use with the new stat servers

https://gerrit.wikimedia.org/r/809602

I have been doing some testing of an install of a server with an H750 card under ticket T307399 recently.

One thing that I have ascertained, with the help of @fgiunchedi, is that the swapping of /dev/sda and /dev/sdb can be reversed, so we don't need to make a lot of custom partman recipes for these hosts with H750 cards.

The key bit of information is this:

image.png (267×883 px, 63 KB)

Which is from this page of the PERC11 User Guide.

I deleted both the RAID1 and the RAID10 virtual drives, then recreated them in the order:

  1. RAID10
  2. RAID1

Now when the server boots the SFF RAID1 drives are successfully detected as /dev/sda and the LFF RAID10 drives are detected as /dev/sdb

I can add some information to https://wikitech.wikimedia.org/wiki/Raid_setup but I thought I'd mention it here first, because I know that some people are now installing these servers and running up against the drive ordering issue.
cc @RobH and @Andrew and, I think, @Papaul.

Change 809641 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] testing h750 recipes

https://gerrit.wikimedia.org/r/809641

Change 809641 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] testing h750 recipes

https://gerrit.wikimedia.org/r/809641

Bah, this was meant to reference the dumpsdata install not this task, my bad.

So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems?

So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems?

That's expected, we still need to adapt the "raid" fact in Puppet so that it installs perccli (but for that we needed a running system with Perc controller, so that we can figure out the device names which allow Puppet to detect the controller). Just leave the system in that state and we'll use dumpsdata1007 for that?

So post dumpsdata1007 install it fails puppet due to megaraid monitoring items it seems?

That's expected, we still need to adapt the "raid" fact in Puppet so that it installs perccli (but for that we needed a running system with Perc controller, so that we can figure out the device names which allow Puppet to detect the controller). Just leave the system in that state and we'll use dumpsdata1007 for that?

Works for me, I'll put this comment reference on the setup task there. Thanks!

Change 809913 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Extend custom raid fact to support Perc 750

https://gerrit.wikimedia.org/r/809913

Papaul closed subtask Unknown Object (Task) as Resolved.Jun 30 2022, 12:14 PM

So I think this is now on Mortiz to roll out the monitoring changes (as he is in the above patchset) and no longer blocked on my testing. I'm assignign this over to him until that rolls live and then this can either come back to me for review of any other pending raid issues or be resolved entirely.

There was one more issue to address with these servers, which (thanks once again to @fgiunchedi) we have now identified and overcome.

It was related to the enumeration/ordering of the devices so when configured to boot from the RAID controller which logical disk would be selected for boot.

After deleting and re-creating the logical disks in reverse order (T297913#8037638) it seems that the first disk created was being selected. This corresponds to /dev/sdb as far as the operating system is concerned, which is not where grub was being installed.

We pressed F2 to get into into System Setup and then selected the following options:

-> Device Settings
-> -> RAID Controller in Slot 6: Dell PERC H750 Adapter Configuration Utility
-> -> -> Main Menu
-> -> -> -> Controller Management
-> -> -> -> -> Select Boot Device

At this point we could choose which device would be presented to the O/S as the boot device and could change the default setting to the RAID1 SFF drives.

image.png (389×730 px, 43 KB)

I'll update https://wikitech.wikimedia.org/wiki/Raid_setup with this information. I don't know if this is something that could be done with the automatic setup or whether it would have to be a manual step

Interestingly, I couldn't find a way to configure this setting from the HTTPS user interface.


We also had to change the default boot sequence to use the HDD first, although this is less of an obscure setting.

image.png (393×732 px, 24 KB)

Change 811667 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] Add Nagios script for monitoring the Dell PERC RAID controller.

https://gerrit.wikimedia.org/r/811667

Change 812250 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga.

https://gerrit.wikimedia.org/r/812250

Change 811667 merged by Slyngshede:

[operations/puppet@production] Add Nagios script for monitoring the Dell PERC RAID controller.

https://gerrit.wikimedia.org/r/811667

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 12 2022, 12:37 PM
RobH reopened subtask Unknown Object (Task) as Open.Jul 18 2022, 4:52 PM

Change 809913 abandoned by Muehlenhoff:

[operations/puppet@production] Extend custom raid fact to support Perc 750

Reason:

Obsoleted by the new raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/809913

Change 809913 restored by Slyngshede:

[operations/puppet@production] Extend custom raid fact to support Perc 750

https://gerrit.wikimedia.org/r/809913

Change 809913 abandoned by Slyngshede:

[operations/puppet@production] Extend custom raid fact to support Perc 750

Reason:

https://gerrit.wikimedia.org/r/809913

Change 825728 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga.

https://gerrit.wikimedia.org/r/825728

Change 812250 abandoned by Slyngshede:

[operations/puppet@production] c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga.

Reason:

See: 825728

https://gerrit.wikimedia.org/r/812250

Change 825728 merged by Slyngshede:

[operations/puppet@production] c:raid::perccli add PowerEdge RAID Controller monitoring to Icinga.

https://gerrit.wikimedia.org/r/825728

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Dec 6 2022, 7:08 PM

We're using this controller for quite a while now, closing the task.