Page MenuHomePhabricator

Predictive failures on disk S.M.A.R.T. status
Open, LowPublic

Description

We have a bunch of predictive failures which should be taken care of - however it is not worth of replace those disks until actual failure.
I keep this list updated.

  • db2035 s2 master T224456 to be decommissioned
  • db2037 m5 codfw master T221512 to be decommissioned
  • db2043 s3 master to be decommissioned
  • db2044 m2 codfw master T217755 T227829 to be decommissioned
  • db2047 s7 master T212966 # to be decommissioned
  • db2049 s2 T227107 to be decommissioned
  • db2050 s3 to be decommissioned
  • db2051 s4 to be decommissioned
  • db2052 s5 master T218776 to be decommissioned
  • db2053 s6 to be decommissioned T231407
  • db2055 to be decommissioned T233186
  • db2061 s7
  • db2063 s2 to be decommissioned
  • db2067 m2
  • db2070 s1 T219852
  • db1070 s5 master decommissioned
  • db1072 m3 master to be decommissioned
  • db1073 m5 master T215050 to be decommissioned
  • db1065 m2 master to be decommissioned
  • db1063 m1 master T211537 to be decommissioned
  • db1069 x1 master to be decommissioned

Related Objects

StatusAssignedTask
OpenNone
DeclinedCmjohnson
OpenMarostegui
DeclinedNone
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
ResolvedPapaul
StalledMarostegui
ResolvedPapaul
ResolvedPapaul
OpenPapaul
ResolvedMarostegui
OpenJclark-ctr
ResolvedMarostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Banyek added a comment.Dec 3 2018, 9:16 AM

db1063

name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: Optimal
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 3 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 7
			Drive's position: DiskGroup: 0, Span: 3, Arm: 1
			Media Error Count: 2
			Other Error Count: 0
			Predictive Failure Count: =====> 1 <=====
			Last Predictive Failure Event Seq Number: 2776

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 34C (93.20 F)
Marostegui updated the task description. (Show Details)Dec 10 2018, 6:35 AM
Marostegui updated the task description. (Show Details)Jan 1 2019, 12:49 PM
Marostegui updated the task description. (Show Details)Jan 4 2019, 7:44 PM
Marostegui updated the task description. (Show Details)Jan 8 2019, 2:30 PM
Marostegui updated the task description. (Show Details)Jan 21 2019, 4:54 PM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Feb 8 2019, 6:11 AM
Marostegui updated the task description. (Show Details)Feb 12 2019, 6:40 AM
jcrespo updated the task description. (Show Details)Mar 6 2019, 12:03 PM

db2052:

root@db2052:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 001438033746C30)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Mar 19 2019, 6:22 AM
Marostegui updated the task description. (Show Details)Mar 20 2019, 2:17 PM
Marostegui updated the task description. (Show Details)Mar 21 2019, 4:05 PM
Marostegui updated the task description. (Show Details)EditedApr 1 2019, 5:10 AM
root@db2070:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337FADD0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 2 2019, 6:28 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Apr 9 2019, 9:05 AM

db2037, m5 codfw master:

root@db2037:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380312088E0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 12 2019, 4:57 AM

db2044 again:

root@db2044:~# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)


   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui updated the task description. (Show Details)Apr 12 2019, 4:58 AM
Marostegui added a comment.EditedApr 19 2019, 8:07 AM

db2047 has another disk failed:

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)

They are on different SPANs:

root@db2047:~# hpssacli controller all show config detail
<snip>
      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001CD41C53362A4E633F9D52
         Disk Name: /dev/sda
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: A41E281B0014380337E0DB0F072
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Predictive Failure)
Marostegui updated the task description. (Show Details)Apr 21 2019, 7:02 AM

T222526 db2049 (again?)

Marostegui updated the task description. (Show Details)May 6 2019, 5:08 AM

T222526 db2049 (again?)

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

Marostegui updated the task description. Feb 12 2019, 07:40:

https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-xk55krwzcenljvw/

You might be confused with db2047, I don't recall db2049 having a disk replaced lately

Marostegui updated the task description. Feb 12 2019, 07:40:
https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-xk55krwzcenljvw/

That's almost 3 months ago, that's why I mentioned "lately" :-)

Marostegui updated the task description. (Show Details)May 31 2019, 3:31 PM
Marostegui updated the task description. (Show Details)Jun 3 2019, 9:56 AM
Marostegui updated the task description. (Show Details)Jun 16 2019, 3:04 PM
Marostegui updated the task description. (Show Details)Jun 16 2019, 3:50 PM
Marostegui updated the task description. (Show Details)Jun 17 2019, 10:25 AM
Marostegui updated the task description. (Show Details)Jun 23 2019, 5:43 AM
Marostegui updated the task description. (Show Details)Jun 24 2019, 1:12 PM
Marostegui updated the task description. (Show Details)Jun 24 2019, 5:58 PM
Marostegui updated the task description. (Show Details)Jul 1 2019, 4:40 AM
Marostegui updated the task description. (Show Details)Jul 3 2019, 6:25 AM
Marostegui updated the task description. (Show Details)Jul 4 2019, 5:02 AM
Marostegui updated the task description. (Show Details)Jul 9 2019, 10:04 AM
Marostegui updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)Jul 12 2019, 6:37 AM
Marostegui updated the task description. (Show Details)Jul 18 2019, 5:42 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:16 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:22 AM
Marostegui updated the task description. (Show Details)Jul 30 2019, 7:26 AM
Marostegui updated the task description. (Show Details)Aug 9 2019, 8:37 AM
Marostegui updated the task description. (Show Details)Aug 12 2019, 9:34 AM

db2044 now has a second disk in predictive failure:

# hpssacli controller all show config

Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)

   Port Name: 1I

   Port Name: 2I

   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)


      logicaldrive 1 (3.3 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378  (WWID: 50014380324D4EB9, Port: 1I, Box: 1)

   Expander 380  (WWID: 50014380324D4EA0, Port: 1I, Box: 1)

   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379  (WWID: 50014380264FFFBF)

db2044 now has a second disk in predictive failure:

# hpssacli controller all show config
Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380264FFFB0)
   Port Name: 1I
   Port Name: 2I
   Gen8 ServBP 12+2 at Port 1I, Box 1, OK
   array A (SAS, Unused Space: 0  MB)
      logicaldrive 1 (3.3 TB, RAID 1+0, OK)
      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
   Enclosure SEP (Vendor ID HP, Model Gen8 ServBP 12+2) 378  (WWID: 50014380324D4EB9, Port: 1I, Box: 1)
   Expander 380  (WWID: 50014380324D4EA0, Port: 1I, Box: 1)
   SEP (Vendor ID PMCSIERA, Model SRCv8x6G) 379  (WWID: 50014380264FFFBF)

Yeah, I am replacing that host today hopefully

Marostegui updated the task description. (Show Details)Aug 20 2019, 10:54 AM
Marostegui updated the task description. (Show Details)Aug 21 2019, 10:09 AM
Marostegui updated the task description. (Show Details)Aug 28 2019, 6:28 AM
Marostegui updated the task description. (Show Details)Aug 29 2019, 7:48 AM
Marostegui updated the task description. (Show Details)Sep 3 2019, 6:29 AM
Marostegui updated the task description. (Show Details)Sep 4 2019, 8:48 AM
Marostegui updated the task description. (Show Details)Sep 11 2019, 5:14 AM
jijiki removed a subscriber: jijiki.Sep 11 2019, 10:06 AM
jcrespo updated the task description. (Show Details)Sep 17 2019, 4:59 AM
Marostegui updated the task description. (Show Details)Sep 18 2019, 5:28 AM
Marostegui updated the task description. (Show Details)Sep 28 2019, 5:00 AM
Marostegui updated the task description. (Show Details)Tue, Nov 12, 11:14 AM