Page MenuHomePhabricator

backup1001 can't address the disk shelf's drives
Closed, ResolvedPublic

Description

akosiaris@backup1001:~$ sudo megacli -adpcount
                                     

Controller Count: 1.

Exit Code: 0x01
akosiaris@backup1001:~$ sudo megacli -PDList -aALL
                                     
Adapter #0


Exit Code: 0x00

So for some reason the adapter can't see the disk shelf's drives.

For a comparison, the same exactly hardware in codfw outputs

akosiaris@backup2001:~$ sudo megacli -PDList -aALL
                                     
Adapter #0

Enclosure Device ID: 65
Slot Number: 0
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A
Device Id: 4
WWN: 50000398981BA724
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 5.458 TB [0x2baa0f4b0 Sectors]
Non Coerced Size: 5.457 TB [0x2ba90f4b0 Sectors]
Coerced Size: 5.457 TB [0x2ba900000 Sectors]

<And a lot more removed for brevity>

Event Timeline

akosiaris created this task.Jul 5 2019, 3:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 5 2019, 3:11 PM
MoritzMuehlenhoff triaged this task as High priority.Jul 5 2019, 6:22 PM
wiki_willy added a project: ops-eqdfw.
wiki_willy edited projects, added ops-eqiad; removed ops-eqdfw.
wiki_willy moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

@Cmjohnson - not sure if there's a loose connection somewhere on backup1001, but can you check it out when you have a few cycles? This one needs to be up and running, before data can be migrated over from helium (which is slated to be decom'd) . Thanks, Willy

This is odd, I am not getting a link light on the raid controller connections.

@Cmjohnson @wiki_willy What can we do to help get this unstuck? I am not at all sure why something like this would happen.

Output of

sudo megacli -AdpAllInfo -a0

from the machine does not show any easily spotted reason for this to happen and a diff with backup2001 (where everything is working fine), is minimal and due to it finding the drives.

All I can think of is some power/SAS cable being damaged

Cmjohnson reassigned this task from Cmjohnson to Jclark-ctr.Sep 9 2019, 3:49 PM
Cmjohnson added a subscriber: Jclark-ctr.

this got lost in the shuffle....will work on it this week . @Jclark-ctr can you contact HPE support and open a ticket please.

jcrespo added a subscriber: jcrespo.EditedSep 12 2019, 3:09 PM

I got asked specifically about this by mark. He asked me to track the progress of this as it blocks an important goal and general service (backups). Old backup hardware is getting older and has more chance for failure. We more than understand hardware doesn't do precisely what we want all the time :-P, but we would need to otherwise workaround any blocker we had.

(BTW, I know power and network maintenance is taking most of your time)

Opened Dell Support Ticket

Reseated cables in back of array and controller link light appeared Will wait for verification if working prior to closing

@jcrespo

jcrespo added a comment.EditedSep 13 2019, 7:22 AM

The raid is back \o/

root@backup1001:~$ sudo megacli -PDList -aALL | grep 'Firmware state'
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(bad)
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Unconfigured(good), Spun Up

Was the Dell support ticket about a disk, cable, controller issue? Did you open a separate phabricator ticket about the failed disk? If not, we should close this only after it arrives/the issue is solved fully.

Thanks for the work!

@jcrespo Support ticket did not include disk. it was only a cable issue. No other tickets open.

Actually we need to close this task and open a separate task about the
disk. Different issue should get a different task.

jcrespo closed this task as Resolved.Sep 13 2019, 6:15 PM

I am ok with that (I actually said to open a new one if this was to be closed). I just didn't know if it had to be open to track whatever was you sent to dell. I opened T232882 now.