Page MenuHomePhabricator

db2038 two disks with predictive failure
Closed, ResolvedPublic

Description

Hi,

db2038 has two disks with predictive failure, can we get them replaced asap?

root@db2038:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438031205310)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptOct 8 2017, 3:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui moved this task from Triage to In progress on the DBA board.Oct 8 2017, 3:25 PM

Correct me if I am misunderstanding something, but on RAID 10, we can lose a whole mirror group and we would be ok, what we cannot lose is the same disk on the two mirrors which may be what we are seeing here? https://serverfault.com/questions/381593/6-disk-raid-10#comment391895_381601

Correct me if I am misunderstanding something, but on RAID 10, we can lose a whole mirror group and we would be ok, what we cannot lose is the same disk on the two mirrors which may be what we are seeing here? https://serverfault.com/questions/381593/6-disk-raid-10#comment391895_381601

You are totally right!
I just looked at the :8 and :2 and didn't pay much attention to the rest.

Marostegui renamed this task from db2038 two disks with predictive failure to db2038 disk with predictive failure.Oct 9 2017, 7:20 AM
Marostegui triaged this task as High priority.
Marostegui updated the task description. (Show Details)
Marostegui added a comment.EditedOct 9 2017, 7:29 AM

They are actually two different disks indeed by looking at the serials.

Marostegui renamed this task from db2038 disk with predictive failure to db2038 two disks with predictive failure.Oct 9 2017, 7:30 AM
Marostegui updated the task description. (Show Details)
Papaul added a comment.Oct 9 2017, 7:45 AM

@Marostegui this server is out of warranty 2017-07-10. We need to find out if any of the decommissioned servers have the same disks that we can use.

@Marostegui Should we do a master failover? We had planned it anyway- this is a good excuse. I know the answer is "yes, if we find the time" :-) Maybe I can take care of it, if there are no alters running on s6.

This is not a master :-)
But yes, s6 needs a master failover anyways to decommission db2028 (s6 master) and to finish T169501.
There are no alters running on s6 or scheduled for it, so it could be done if you'd like to take care of it. I am happy to assist

This is not a master :-)

Oh! Easier, then- so just pooling one new server in preparation for the failover.

But this doesn't solve our issue, db2038 is not supposed to go away :-/

But this doesn't solve our issue, db2038 is not supposed to go away :-/

No, only hosts <2030 are supposed to go away. That is why I was saying that the failover is needed anyways :-)

db2028 RAID claims it has 558.911 GB disks. db2038 RAID claims it has 600GB, maybe the actual size is the same? In that case the failover could actually help.

By looking at both hosts' disks serial numbers, they are both 600GB 15k SAS 3.5" so maybe we can exchange them.
@Papaul probably knows better if we can exchange those two

And one of them failed already: T177844

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Failed)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui reopened this task as Open.Oct 10 2017, 2:48 PM
Marostegui closed this task as a duplicate of T177844: Degraded RAID on db2038.
Marostegui added a subscriber: ops-monitoring-bot.

@Papaul let us know if you were able to find disks to replace the (now) broken one and the one that will soon fail.
Thanks!

@Papaul db2010 which is scheduled for decommissioning (T175685) has the same chassis, so maybe it also has the same disks?

@Marostegui I have some 600GB 15k that I can pull out off db2025. Just keep in mind that those are Dell disks

If db2025 is decommissioned, I would say let's go ahead...

Btw, let's change just one disk at the time.

ok I i will replaced first slot 1

Sounds good - thank you

Complete. Let me know when ready for slot 7

Thanks, RAID rebuilding now:

logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Rebuilding)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

Might take a few hours to rebuild.
I will ping you anyways when done

@Papaul the rebuild for that disk has failed - can we try another spare disk maybe?

Here we go again:

logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Rebuilding)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

Cross your fingers!

Marostegui added a comment.EditedOct 10 2017, 5:20 PM

@Papaul the first disk went fine, can you change the other one pending now?

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

I can see it is rebuilding now - thanks!

logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, Rebuilding)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui closed this task as Resolved.Oct 11 2017, 5:15 AM

And all good now! Thanks a lot @Papaul

root@db2038:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438031205310)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)