Page MenuHomePhabricator

diagnose failed disks on ms-be1027
Closed, ResolvedPublic

Description

It looks like ms-be1027 is reporting two failed disks (the ssd) and I've seen puppet fail on sdc/sdd at least. The machine installed fine though and @Cmjohnson reported blinking/failed disks.

The combination makes me think of a possibly broken hw controller or more widespread failure, the machine isn't in service yet though.

Event Timeline

Disk was sent to SF office not data center. Working on getting the disk sent or a new one sent from Dell.

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5310633913
STATUS: CASE IS GENERATED AND IN PROGRESS

Product description: HP ProLiant DL380 Gen9 12LFF Configure-to-order Server
Product number: 719061-B21
Serial number: MXQ62108H3
Subject: SCM_HW:2 Failed SSD's

I requested 2 SSD"s to be sent and the confirmation email states 2 SSD's but they actually sent me 2 4TB HDD's instead. A call to them has to take place.

Return tracking info for wrong disks tracking info.

IMG_4185.JPG (3×4 px, 534 KB)

Received the 2 ssds and added them to ms-be1027

Mentioned in SAL [2016-08-18T17:15:23Z] <godog> reinstall ms-be1027 after ssd replaced T140374

I've reenabled the two ssd and reinstalled ms-be1027, afaict no errors reported now, @Cmjohnson no disks blinking either now?

if so we can call this done!

or maybe not! just today there are four (!) faults reported

=> ld all show

Smart Array P840 in Slot 3
   array A
      logicaldrive 1 (186.3 GB, RAID 0, Failed)
   array B
      logicaldrive 2 (186.3 GB, RAID 0, Failed)
   array C
      logicaldrive 3 (2.7 TB, RAID 0, OK)
   array D
      logicaldrive 4 (2.7 TB, RAID 0, OK)
   array E
      logicaldrive 5 (2.7 TB, RAID 0, OK)
   array F
      logicaldrive 6 (2.7 TB, RAID 0, OK)
   array G
      logicaldrive 7 (2.7 TB, RAID 0, OK)
   array H
      logicaldrive 8 (2.7 TB, RAID 0, OK)
   array I
      logicaldrive 9 (2.7 TB, RAID 0, Failed)
   array J
      logicaldrive 10 (2.7 TB, RAID 0, Failed)
   array K
      logicaldrive 11 (2.7 TB, RAID 0, OK)
   array L
      logicaldrive 12 (2.7 TB, RAID 0, OK)
   array M
      logicaldrive 13 (2.7 TB, RAID 0, OK)
   array N
      logicaldrive 14 (2.7 TB, RAID 0, OK)

I have troubles believing those to be real faults at this point, we should involve hp @Cmjohnson what do you think?

I see the 4 failed disks (amber lights) on the sever....I am finding it
hard to believe that this server was shipped w/ so many bad disks. Has to
be something else.

@Cmjohnson thoughts on this? we'd need to involve HP to debug further and/or replace parts ?

Cmjohnson raised the priority of this task from Medium to High.Oct 11 2016, 4:16 PM

uploaded reports to the HP online portal

Quick update, I created a ticket with HP, supplied with logs, I was contacted once for more information and provided but did not hear back in a few days. A phone call to HP is necessary.

Tried a reinstall, though only some spinning disks are seen and no SSDs (all 3TB sizes reported)

~ # cat /proc/partitions 
major minor  #blocks  name

   8        0 2930233816 sda
   8        1   58592256 sda1
   8        2     976896 sda2
   8        3   97655808 sda3
   8        4 2773007360 sda4
   8       16 2930233816 sdb
   8       17   58592256 sdb1
   8       18     976896 sdb2
   8       19   97655808 sdb3
   8       20 2773007360 sdb4
   8       32 2930233816 sdc
   8       33 2930232320 sdc1
   8       48 2930233816 sdd
   8       49 2930232320 sdd1
   8       64 2930233816 sde
   8       65 2930232320 sde1
   8       80 2930233816 sdf
   8       81 2930232320 sdf1
   8       96 2930233816 sdg
   8       97 2930232320 sdg1
   9        1     976320 md1

@fgiunchedi spent the morning on the phone with HP....good news bad news. they're sending me a new system board, backplane and 2 new ssds...bad news, system board is on backorder. I will update once I get the call from HP Field Services.

We can escalate this issue to Dasher, they may be able to expedite the system board replacement.

Do you have a case # I can pass along to Dasher?

Yes,

Support Case Number: 5314079417-531

@fgiunchedi replaced, both backplanes, system board and rear ssds. I am able to pxe boot now. I will leave this open. If all goes well please resolve this task...if not please update. Thanks

thanks @Cmjohnson for taking care of this! LGTM now, will progressively put the machine in service in T136631: rack/setup/deploy ms-be102[2-7]

I spoke way too soon, machine still reports failures on SSDs as in P4409 :(

Looks like to me it might be just DOA?

@RobH is there anything you can do with the vendor? We replaced, the system board, both disk back planes, ssds (several times). The only thing left is the raid controller

A new case was opened with HP to replace the raid card

Case ID: 5315048752
Case title:
Failed Raid Card
Severity 3-Normal