Page MenuHomePhabricator

Rebuild raids on labvirt1019 and 1020
Closed, DuplicatePublic

Description

Each of labvirt1019 and 1020 should have one big hardware raid-10, of 8+ Tb. Right now their raids are smaller, seemingly only made up of some of the available SSDs.

I would rebuild these myself but the remote raid tool hangs when I try to remove the existing logical drive.

I'm happy to do the OS installation &c. once the drives are set up.

Related Objects

StatusAssignedTask
Resolvedchasemp
DuplicateCmjohnson

Event Timeline

Andrew triaged this task as Normal priority.Feb 14 2018, 8:26 PM
Andrew created this task.
Andrew added a project: ops-eqiad.
Andrew added a comment.EditedFeb 14 2018, 8:39 PM
=> ctrl slot=0 pd all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 1.6 TB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 1.6 TB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 1.6 TB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 1.6 TB): OK
   physicaldrive 2I:1:5 (port 2I:box 1:bay 5, 1.6 TB): OK
   physicaldrive 2I:1:6 (port 2I:box 1:bay 6, 1.6 TB): OK
   physicaldrive 2I:1:7 (port 2I:box 1:bay 7, 1.6 TB): OK
   physicaldrive 2I:1:8 (port 2I:box 1:bay 8, 1.6 TB): OK

That's 8 1.6Tb drives.

And yet... the system claims that it's using all 8 drives and only getting 5.6 Tb.

=> ctrl slot=0 ld 1 show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      Logical Drive: 1
         Size: 5.8 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1024 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Unique Identifier: 600508B1001C5FE68316F9D37116FC1D
         Disk Name: /dev/sda 
         Mount Points: None
         Logical Drive Label: 0210FE25PDNLH0BRH8227EAF39
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:8 (port 2I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
         Drive Type: Data
         LD Acceleration Method: SSD Smart Path

The quote for the original order here seems to indicate 20 drives over 2 servers so 10 each?

OK, so step one is for Chris to crack open those cases and count the SSDs :(

RobH added a comment.Feb 14 2018, 8:49 PM

Indeed, the order shows 10 disks per system, not 8.

I wonder if the other 2 show up as non-raid configured disks and need to bet set to raid configured? (I know this is something we do in Dell systems, not sure if HP also requires it.)

Andrew renamed this task from Rebuild hardware raids for labvirt1019 and 1020 to Count the SSDs inside of labvirt1019 and/or labvirt1020.Feb 14 2018, 9:02 PM
Andrew renamed this task from Count the SSDs inside of labvirt1019 and/or labvirt1020 to Rebuild raids on labvirt1019 and 1020.Feb 14 2018, 9:16 PM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Feb 16 2018, 3:48 PM

Confirmed 10 each assigning to @Andrew to resolve if satisfied

chasemp reassigned this task from chasemp to Andrew.Feb 20 2018, 7:55 PM
Andrew reassigned this task from Andrew to Cmjohnson.Feb 20 2018, 8:33 PM

on labvirt1019:

=> ctrl slot=0 pd all show status

   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 1.6 TB): OK
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 1.6 TB): OK
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, 1.6 TB): OK
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, 1.6 TB): OK
   physicaldrive 2I:1:5 (port 2I:box 1:bay 5, 1.6 TB): OK
   physicaldrive 2I:1:6 (port 2I:box 1:bay 6, 1.6 TB): OK
   physicaldrive 2I:1:7 (port 2I:box 1:bay 7, 1.6 TB): OK
   physicaldrive 2I:1:8 (port 2I:box 1:bay 8, 1.6 TB): OK

So it can still only see eight... something needs to happen so that the raid utility can see all 10.

Labvirt1020 is worse: When I try to enter HPSSA (which works on 1019 after a delay) 1020 just immediately says "This item is currently not available. Press 'F1' to get more information about this item. " F1 just dismisses the dialog and doesn't actually provide me with more info.

Probably the best path forward is for Chris to attach to both those systems locally and re-create the raids.

I apologize Inreacll an IRC conversation about this. I will need to reboot
them into raid bios and re-configure the raid. Any issues with that?

Nope, you can reboot/rebuild them at any time.

I don't what is wrong with these servers, I count 10 disks but the controller is only seeing 8. I don't see any settings that would change that..I am going to have to get HP involved.

RobH added a comment.Mar 12 2018, 4:36 PM

Update: My understanding is this is now awating Chris to open a support case with HP about this. Once we have that, if they don't provide a solution in a expedient manner, we can escalate to our Dasher/HP account team.

A case has been opened. Your case was successfully submitted. Please note your Case ID: 5328012773 for future reference.

This is still ongoing with HP...they wanted me to do a few things. The status is the same -- broke

Quick update...this is turning into a giant time suck....HP does not know what the problem is, now they're asking for pictures of the cabling inside the server.

faidon raised the priority of this task from Normal to High.Apr 5 2018, 9:08 AM

@Cmjohnson @RobH This has been going on for weeks now, and this is too much of a delay for setting up these systems. I'm elevating this task's priority, let's get to the bottom of this ASAP. A lot of the delays were just on our side, but I see that HPE is delaying this further too; please escalate within HPE and/or with me if you are not getting timely responses.

To the matter at hand, the RAID controller only seeing 8 disks seems suspicious, as many (most?) RAID controllers have 8 internal ports, so it seems likely to me that it's related to that. What kind of RAID controller(s) do these boxes have and how are the disks connected to them?

faidon added a comment.Apr 5 2018, 9:17 AM

OK, I just saw above that this is a HPE Smart Array P440ar controller. According to the specs, the controller has "Internal: 8 SAS/SATA physical links across 2 x4 ports". So I think each of the ports connects to one of the internal cages (1I and 2I), with each holding 4 disks. That's all normal and according to the specs, and 8 disks is the maximum that this controller can hold. Where are the other two disks located (front/back?), and where are they connected?

@faidon. That would explain the issue, the disk are in the front in slots 4
and 6.

Do we buy a new controller or go with 8 ssds?

faidon added a comment.Apr 5 2018, 1:01 PM

I don't understand :) Could you clarify which disks are in which slots, and how/where are they connected?

I wouldn't go with 8 SSDs; we bought the system with 10 SSDs as a bundle and paid for it in its entirety. If the systems needs a new RAID controller, we'll ask HPE to provide it at no additional cost. But let's make sure first that a new RAID controller is needed, this is just a theory so far :)

I received this last last night from HP...I will try this first (i am hoping I have the cable)

Thank you for providing the screenshots.

Looks like 764630-B21 - HPE DL360 GEN9 2SFF SAS/SATA UNIVERSAL MEDIA BAY KIT is installed in the server which can accommodate another 2 drives.

NOTE: This kit contains a cable to attach the 2 drives to the internal B140i SATA controller.

The Embedded controller needs to be enabled in UEFI BIOS.

Please follow the steps:

  1. From the System Utilities screen, select System Configuration > BIOS/Platform Configuration (RBSU) > System Options > SATA Controller Options > Embedded SATA Configuration and press Enter.
  1. Ensure that you are using the correct ACHI or RAID system drivers for your SATA option.
  1. Select a setting and press Enter.

a. Enable SATA AHCI Support—Enables the embedded chipset SATA controller for AHCI.
b. Enable Dynamic Smart Array RAID Support—Enables the embedded chipset SATA controller for Dynamic Smart Array RAID.

  1. Press F10
faidon added a comment.Apr 5 2018, 2:54 PM

Ah! That's a regular mainboard/SATA controller, so these two drives wouldn't be able to participate in RAID groups. We've done that before I think, at least with Dells, where we had the system drives connected separately.

@Andrew/WMCS, is a RAID group (or more) across all drives desired? If so, we can make a bigger fuss with HPE and see whether there's a RAID controller that we could swap that would support that. If not, then we can try with the hardware that we have ordered.

Andrew added a comment.Apr 5 2018, 6:41 PM

We'll end up wasting a fair bit of space if we have to break these up into separate volumes. Pinging @chasemp in case he thinks we can make this work with separate volumes but, yeah, I think we probably need a new controller.

I think we probably need a new controller.

Yep

bd808 added a comment.Apr 9 2018, 8:47 PM

Discussed briefly in the 2018-04-09 SRE team meeting. @RobH mentioned that he would look into getting a quote from HP on a RAID card that can support all 10 drives.

chasemp closed this task as a duplicate of Unknown Object (Task).Apr 27 2018, 6:52 PM
bd808 moved this task from Inbox to Done on the cloud-services-team (Kanban) board.May 6 2018, 6:49 PM
bd808 closed subtask Unknown Object (Task) as Resolved.May 15 2018, 8:23 PM