Page MenuHomePhabricator

Predictive failures on disk S.M.A.R.T. status
Closed, ResolvedPublic

Description

We have a bunch of predictive failures which should be taken care of - however it is not worth of replace those disks until actual failure.
I keep this list updated.

  • db2035 s2 master T224456 to be decommissioned
  • db2037 m5 codfw master T221512 to be decommissioned
  • db2043 s3 master to be decommissioned
  • db2044 m2 codfw master T217755 T227829 to be decommissioned
  • db2047 s7 master T212966 # to be decommissioned
  • db2049 s2 T227107 to be decommissioned
  • db2050 s3 to be decommissioned
  • db2051 s4 to be decommissioned
  • db2052 s5 master T218776 to be decommissioned
  • db2053 s6 to be decommissioned T231407
  • db2055 to be decommissioned T233186
  • db2061 s7 to be decommissioned T238526
  • db2063 s2 to be decommissioned
  • db2067 m2 to be decommissioned T233185
  • db2070 s1 T219852
  • db1070 s5 master decommissioned
  • db1072 m3 master to be decommissioned
  • db1073 m5 master T215050 to be decommissioned
  • db1065 m2 master to be decommissioned
  • db1063 m1 master T211537 to be decommissioned
  • db1069 x1 master to be decommissioned

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
DeclinedCmjohnson
ResolvedMarostegui
DeclinedNone
ResolvedRequestPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedRequestPapaul
ResolvedMarostegui
ResolvedJclark-ctr
ResolvedMarostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui updated the task description. (Show Details)Feb 8 2019, 6:11 AM
Marostegui updated the task description. (Show Details)Feb 12 2019, 6:40 AM
jcrespo updated the task description. (Show Details)Mar 6 2019, 12:03 PM

db2052:

root@db2052:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033746C30)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Mar 19 2019, 6:22 AM
Marostegui updated the task description. (Show Details)Mar 20 2019, 2:17 PM
Marostegui updated the task description. (Show Details)Mar 21 2019, 4:05 PM
Marostegui updated the task description. (Show Details)EditedApr 1 2019, 5:10 AM
root@db2070:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337FADD0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 2 2019, 6:28 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Apr 9 2019, 9:05 AM

db2037, m5 codfw master:

root@db2037:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380312088E0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 12 2019, 4:57 AM

db2044 again:

root@db2044:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 12 2019, 4:58 AM
Marostegui added a comment.EditedApr 19 2019, 8:07 AM

db2047 has another disk failed:

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)

They are on different SPANs:

root@db2047:~# hpssacli controller all show config detail
<snip>
      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001CD41C53362A4E633F9D52
         Disk Name: /dev/sda
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: A41E281B0014380337E0DB0F072
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)
Marostegui updated the task description. (Show Details)Apr 21 2019, 7:02 AM

T222526 db2049 (again?)

Marostegui updated the task description. (Show Details)May 6 2019, 5:08 AM

T222526 db2049 (again?)

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

Marostegui updated the task description. Feb 12 2019, 07:40:

https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-xk55krwzcenljvw/

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

Marostegui updated the task description. Feb 12 2019, 07:40:

https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-xk55krwzcenljvw/

That's almost 3 months ago, that's why I mentioned "lately" :-)

Marostegui updated the task description. (Show Details)May 31 2019, 3:31 PM
Marostegui updated the task description. (Show Details)Jun 3 2019, 9:56 AM
Marostegui updated the task description. (Show Details)Jun 16 2019, 3:04 PM
Marostegui updated the task description. (Show Details)Jun 16 2019, 3:50 PM
Marostegui updated the task description. (Show Details)Jun 17 2019, 10:25 AM
Marostegui updated the task description. (Show Details)Jun 23 2019, 5:43 AM
Marostegui updated the task description. (Show Details)Jun 24 2019, 1:12 PM
Marostegui updated the task description. (Show Details)Jun 24 2019, 5:58 PM
Marostegui updated the task description. (Show Details)Jul 1 2019, 4:40 AM
Marostegui updated the task description. (Show Details)Jul 3 2019, 6:25 AM
Marostegui updated the task description. (Show Details)Jul 4 2019, 5:02 AM
Marostegui updated the task description. (Show Details)Jul 9 2019, 10:04 AM
Marostegui updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)Jul 12 2019, 6:37 AM
Marostegui updated the task description. (Show Details)Jul 18 2019, 5:42 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:16 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:22 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:26 AM
Marostegui updated the task description. (Show Details)Aug 9 2019, 8:37 AM
Marostegui updated the task description. (Show Details)Aug 12 2019, 9:34 AM

db2044 now has a second disk in predictive failure:

# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)

   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378  (WWID: 50014380324D4EB9, Port: 1I, Box: 1)

   Expander 380  (WWID: 50014380324D4EA0, Port: 1I, Box: 1)

   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379  (WWID: 50014380264FFFBF)

db2044 now has a second disk in predictive failure:

# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)

   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378  (WWID: 50014380324D4EB9, Port: 1I, Box: 1)

   Expander 380  (WWID: 50014380324D4EA0, Port: 1I, Box: 1)

   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379  (WWID: 50014380264FFFBF)

Yeah, I am replacing that host today hopefully

Marostegui updated the task description. (Show Details)Aug 20 2019, 10:54 AM
Marostegui updated the task description. (Show Details)Aug 21 2019, 10:09 AM
Marostegui updated the task description. (Show Details)Aug 28 2019, 6:28 AM
Marostegui updated the task description. (Show Details)Aug 29 2019, 7:48 AM
Marostegui updated the task description. (Show Details)Sep 3 2019, 6:29 AM
Marostegui updated the task description. (Show Details)Sep 4 2019, 8:48 AM
Marostegui updated the task description. (Show Details)Sep 11 2019, 5:14 AM
jijiki removed a subscriber: jijiki.Sep 11 2019, 10:06 AM
jcrespo updated the task description. (Show Details)Sep 17 2019, 4:59 AM
Marostegui updated the task description. (Show Details)Sep 18 2019, 5:28 AM
Marostegui updated the task description. (Show Details)Sep 28 2019, 5:00 AM
Marostegui updated the task description. (Show Details)Nov 12 2019, 11:14 AM
Marostegui updated the task description. (Show Details)Nov 20 2019, 8:19 AM
Marostegui updated the task description. (Show Details)Nov 21 2019, 8:28 AM
Marostegui closed this task as Resolved.Dec 3 2019, 5:56 AM
Marostegui updated the task description. (Show Details)

All these hosts have been sent for decommissioning.
Going to close this for now.