Page MenuHomePhabricator

Predictive failures on disk S.M.A.R.T. status
Open, LowPublic

Description

We have a bunch of predictive failures which should be taken care of - however it is not worth of replace those disks until actual failure.
I keep this list updated.

  • db2035 s2 master T224456 to be decommissioned
  • db2037 m5 codfw master T221512 to be decommissioned
  • db2043 s3 master to be decommissioned
  • db2044 m2 codfw master T217755 T227829 to be decommissioned
  • db2047 s7 master T212966 # to be decommissioned
  • db2049 s2 T227107 to be decommissioned
  • db2050 s3 to be decommissioned
  • db2051 s4 to be decommissioned
  • db2052 s5 master T218776 to be decommissioned
  • db2053 s6 to be decommissioned T231407
  • db2061 s7
  • db2063 s2 to be decommissioned
  • db2067 m2
  • db2070 s1 T219852
  • db1072 m3 master to be decommissioned
  • db1073 m5 master T215050 to be decommissioned
  • db1065 m2 master to be decommissioned
  • db1063 m1 master T211537 to be decommissioned
  • db1069 x1 master to be decommissioned

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui updated the task description. (Show Details)Nov 21 2018, 7:37 AM
Marostegui updated the task description. (Show Details)

db2044 got its disk replaced but came up with predictive failure (T210049#4767169)

Banyek updated the task description. (Show Details)Dec 3 2018, 9:16 AM

db1063

name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: Optimal
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 3 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 3, Arm: 1
			Media Error Count: 2
			Other Error Count: 0
			Predictive Failure Count: =====> 1 <=====
			Last Predictive Failure Event Seq Number: 2776

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 34C (93.20 F)
Marostegui updated the task description. (Show Details)Dec 10 2018, 6:35 AM
Marostegui updated the task description. (Show Details)Jan 1 2019, 12:49 PM
Marostegui updated the task description. (Show Details)Jan 4 2019, 7:44 PM
Marostegui updated the task description. (Show Details)Jan 8 2019, 2:30 PM
Marostegui updated the task description. (Show Details)Jan 21 2019, 4:54 PM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Feb 8 2019, 6:11 AM
Marostegui updated the task description. (Show Details)Feb 12 2019, 6:40 AM
jcrespo updated the task description. (Show Details)Mar 6 2019, 12:03 PM

db2052:

root@db2052:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033746C30)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Mar 19 2019, 6:22 AM
Marostegui updated the task description. (Show Details)Mar 20 2019, 2:17 PM
Marostegui updated the task description. (Show Details)Mar 21 2019, 4:05 PM
Marostegui updated the task description. (Show Details)EditedApr 1 2019, 5:10 AM
root@db2070:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337FADD0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 2 2019, 6:28 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Apr 9 2019, 9:05 AM

db2037, m5 codfw master:

root@db2037:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380312088E0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 12 2019, 4:57 AM

db2044 again:

root@db2044:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 12 2019, 4:58 AM
Marostegui added a comment.EditedApr 19 2019, 8:07 AM

db2047 has another disk failed:

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)

They are on different SPANs:

root@db2047:~# hpssacli controller all show config detail
<snip>
      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001CD41C53362A4E633F9D52
         Disk Name: /dev/sda
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: A41E281B0014380337E0DB0F072
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)
Marostegui updated the task description. (Show Details)Apr 21 2019, 7:02 AM

T222526 db2049 (again?)

Marostegui updated the task description. (Show Details)May 6 2019, 5:08 AM

T222526 db2049 (again?)

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

Marostegui updated the task description. Feb 12 2019, 07:40:

https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-xk55krwzcenljvw/

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

Marostegui updated the task description. Feb 12 2019, 07:40:
https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-xk55krwzcenljvw/

That's almost 3 months ago, that's why I mentioned "lately" :-)

Marostegui updated the task description. (Show Details)May 31 2019, 3:31 PM
Marostegui updated the task description. (Show Details)Jun 3 2019, 9:56 AM
Marostegui updated the task description. (Show Details)Jun 16 2019, 3:04 PM
Marostegui updated the task description. (Show Details)Jun 16 2019, 3:50 PM
Marostegui updated the task description. (Show Details)Jun 17 2019, 10:25 AM
Marostegui updated the task description. (Show Details)Jun 23 2019, 5:43 AM
Marostegui updated the task description. (Show Details)Jun 24 2019, 1:12 PM
Marostegui updated the task description. (Show Details)Jun 24 2019, 5:58 PM
Marostegui updated the task description. (Show Details)Jul 1 2019, 4:40 AM
Marostegui updated the task description. (Show Details)Jul 3 2019, 6:25 AM
Marostegui updated the task description. (Show Details)Jul 4 2019, 5:02 AM
Marostegui updated the task description. (Show Details)Jul 9 2019, 10:04 AM
Marostegui updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)Jul 12 2019, 6:37 AM
Marostegui updated the task description. (Show Details)Jul 18 2019, 5:42 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:16 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:22 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:26 AM
Marostegui updated the task description. (Show Details)Aug 9 2019, 8:37 AM
Marostegui updated the task description. (Show Details)Aug 12 2019, 9:34 AM

db2044 now has a second disk in predictive failure:

# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)

   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378  (WWID: 50014380324D4EB9, Port: 1I, Box: 1)

   Expander 380  (WWID: 50014380324D4EA0, Port: 1I, Box: 1)

   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379  (WWID: 50014380264FFFBF)

db2044 now has a second disk in predictive failure:

# hpssacli controller all show config
Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)
   Port Name: 1I
   Port Name: 2I
   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)
      logicaldrive 1 (3.3 TB, RAID 1+0, OK)
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378  (WWID: 50014380324D4EB9, Port: 1I, Box: 1)
   Expander 380  (WWID: 50014380324D4EA0, Port: 1I, Box: 1)
   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379  (WWID: 50014380264FFFBF)

Yeah, I am replacing that host today hopefully

Marostegui updated the task description. (Show Details)Tue, Aug 20, 10:54 AM
Marostegui updated the task description. (Show Details)Wed, Aug 21, 10:09 AM
Marostegui updated the task description. (Show Details)Wed, Aug 28, 6:28 AM
Marostegui updated the task description. (Show Details)Thu, Aug 29, 7:48 AM
Marostegui updated the task description. (Show Details)Tue, Sep 3, 6:29 AM
Marostegui updated the task description. (Show Details)Wed, Sep 4, 8:48 AM
Marostegui updated the task description. (Show Details)Wed, Sep 11, 5:14 AM
jijiki removed a subscriber: jijiki.Wed, Sep 11, 10:06 AM