Page MenuHomePhabricator

Predictive failures on disk S.M.A.R.T. status
Open, LowPublic

Description

We have a bunch of predictive failures which should be taken care of - however it is not worth of replace those disks until actual failure.
I keep this list updated.

  • db2037 m5 codfw master
  • db2044 m2 codfw master T217755
  • db2047 s7 master T212966
  • db2049 s2
  • db2050 s3
  • db2051 s4 master
  • db2052 s5 master T218776
  • db2053 s6
  • db2061 s7
  • db2070 s1 T219852
  • db1073 m5 master T215050
  • db1065 m2 master
  • db1063 m1 master T211537

Event Timeline

Banyek created this task.Oct 30 2018, 3:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2018, 3:02 PM
Banyek triaged this task as Low priority.Oct 30 2018, 3:02 PM
jcrespo moved this task from Triage to Backlog on the DBA board.Oct 30 2018, 3:02 PM
Banyek moved this task from Backlog to In progress on the DBA board.Oct 30 2018, 3:17 PM
Banyek updated the task description. (Show Details)Nov 6 2018, 9:23 AM
Banyek added a subscriber: Papaul.Nov 7 2018, 3:54 PM

Once the disk have failed we will get an automatic ticket for getting that disk replaced. I don't think we need this tracking taks.

Already caught up with Jaime about why this ticket exists. All good here

db2044 came up with predictive failure today:

root@db2044:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Nov 21 2018, 7:25 AM
Marostegui updated the task description. (Show Details)Nov 21 2018, 7:37 AM
Marostegui updated the task description. (Show Details)

db2044 got its disk replaced but came up with predictive failure (T210049#4767169)

Banyek updated the task description. (Show Details)Dec 3 2018, 9:16 AM

db1063

name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: Optimal
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 3 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 3, Arm: 1
			Media Error Count: 2
			Other Error Count: 0
			Predictive Failure Count: =====> 1 <=====
			Last Predictive Failure Event Seq Number: 2776

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 34C (93.20 F)
Marostegui updated the task description. (Show Details)Dec 10 2018, 6:35 AM
Marostegui updated the task description. (Show Details)Jan 1 2019, 12:49 PM
Marostegui updated the task description. (Show Details)Jan 4 2019, 7:44 PM
Marostegui updated the task description. (Show Details)Jan 8 2019, 2:30 PM
Marostegui updated the task description. (Show Details)Jan 21 2019, 4:54 PM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Feb 8 2019, 6:11 AM
Marostegui updated the task description. (Show Details)Feb 12 2019, 6:40 AM
jcrespo updated the task description. (Show Details)Mar 6 2019, 12:03 PM

db2052:

root@db2052:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033746C30)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Mar 19 2019, 6:22 AM
Marostegui updated the task description. (Show Details)Wed, Mar 20, 2:17 PM
Marostegui updated the task description. (Show Details)Thu, Mar 21, 4:05 PM
Marostegui updated the task description. (Show Details)EditedMon, Apr 1, 5:10 AM
root@db2070:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337FADD0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Tue, Apr 2, 6:28 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Tue, Apr 9, 9:05 AM

db2037, m5 codfw master:

root@db2037:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380312088E0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Fri, Apr 12, 4:57 AM

db2044 again:

root@db2044:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Fri, Apr 12, 4:58 AM
Marostegui added a comment.EditedFri, Apr 19, 8:07 AM

db2047 has another disk failed:

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)

They are on different SPANs:

root@db2047:~# hpssacli controller all show config detail
<snip>
      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001CD41C53362A4E633F9D52
         Disk Name: /dev/sda
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: A41E281B0014380337E0DB0F072
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)