Page MenuHomePhabricator

labvirt1009 HP Raid alert
Closed, ResolvedPublic

Description

WARNING: Slot 0: Predictive Failure: 1I:1:9 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18 - Controller: OK - Battery/Capacitor: OK

Can that drive be swapped without downtime?

Event Timeline

Andrew triaged this task as High priority.Jun 29 2018, 1:52 PM

Should be. It's an HP Smart P420i in a RAID 10 logical disk and is the only failure. Unless the disk itself isn't a hot swap form factor, it should be good, right? I'm, of course presuming that it is a hot swap form factor, which might be silly.

@Bstorm this is a hot swap disk but the server is now out of warranty. @RobH should we order spare disks?

So, this has a 300GB SFF SAS Disk. We don't have any of those spare, but we do have a ton of 300GB Intel 710SSDs, according to the spares tracking;

Intel 710 Series SSDSA2BZ300G3 2.5" 300GB QTY:42

I've chatted with @Bstorm about this via IRC. My thoughts are as follows:

  • pull one of the predicted failure SAS disks out of the system and replace it with one of the 42 Intel 710 300GB SSDs.
  • Attempt to rebuild the raid with the SSD. It should be a simple pass/fail. If it rebuilds, successfully, we should be good to use these SSDs to replace the dying SAS disks in these systems.
  • if it fails, put the predicted (but not quite yet wholly) failed SAS disk back into place, and update this task, assigning it back to me.
  • if SAS disks are needed, I'll make an procurement S4 task for the purchase approvals.

So I'm assigning this to @Bstorm for their input/approval (seems rather nicer than saying @Bstorm said this was ok via irc!) and then if approved for testing, assign to @Cmjohnson.

I think this is a lovely idea as long as no other disks die in the meantime :)

So far so good on that end.

@Cmjohnson Do the new spares we got fit this machine?

@Bstorm yes the disk will work for labvirt1009. Do you want me to swap it?

@Cmjohnson Were you able to get around to this? It looks to be in the same state (which will be frustrating if it was already swapped, lol).

@Bstorm the disk has been swapped...resolve once it's back to normal please

Looks great to me! Sorry this got buried.