Page MenuHomePhabricator

Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100)
Closed, ResolvedPublic

Description

The alert SmartNotHealthy has started firing 1 month ago.

Labels
alertname=SmartNotHealthy
cluster=analytics
device=sat+megaraid,9
instance=an-worker1086:9100
job=node
prometheus=ops
severity=warning
site=eqiad
source=prometheus
team=sre
Annotations
NameContent
dashboardhttps://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-worker1086
descriptionThe disk SMART status is *not* healthy, this could be an early warning before the disk fails.
runbookhttps://wikitech.wikimedia.org/wiki/SMART#Alerts
summaryDisk not healthy
Links

Triage metadata. Do not delete.
fingerprint=95e1fe73b01db69c

Event Timeline

BTullis triaged this task as High priority.Nov 29 2023, 9:59 AM
BTullis moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

This is confirmed.

Enclosure Device ID: 32
Slot Number: 9
Drive's position: DiskGroup: 12, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 9
WWN: 5000cca269e30cfe
Sequence Number: 2
Media Error Count: 630531
Other Error Count: 2
Predictive Failure Count: 56
Last Predictive Failure Event Seq Number: 1436661
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b37b0407c9
Connected Port Number: 0(path0) 
Inquiry Data:             K7JH39KTHGST HUS726040ALA610                    A5DEKV35
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :39C (102.20 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : Yes
btullis@an-worker1086:~$ sudo smartctl --info --health -d megaraid,9 /dev/sdk
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-26-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Ultrastar 7K6000
Device Model:     HGST HUS726040ALA610
Serial Number:    K7JH39KT
LU WWN Device Id: 5 000cca 269e30cfe
Add. Product Id:  DELL(tm)
Firmware Version: A5DEKV35
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Dec  1 12:42:13 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Warning: This result is based on an Attribute check.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 412

I will request a replacement from DC Ops.

The disk has been replaced. Now following procedures outlined here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk

We can see the new disk.

btullis@an-worker1086:~$ sudo megacli -PDList -aAll | grep Firm
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: FJ2D
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: KV35
Firmware state: Online, Spun Up
Device Firmware Level: DL43
Firmware state: Online, Spun Up
Device Firmware Level: DL43

The disk has enclosure Device ID: 32 and is slot number: 9

Enclosure Device ID: 32
Slot Number: 9
Enclosure position: 1
Device Id: 9
WWN: 500003978bd01cf2
Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Unconfigured(good), Spun Up

We can see that virtual drive number 10 that is missing.

btullis@an-worker1086:~$ sudo megacli -LDInfo -LAll -aAll|grep Virtual
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Virtual Drive: 1 (Target Id: 1)
Virtual Drive: 2 (Target Id: 2)
Virtual Drive: 3 (Target Id: 3)
Virtual Drive: 4 (Target Id: 4)
Virtual Drive: 5 (Target Id: 5)
Virtual Drive: 6 (Target Id: 6)
Virtual Drive: 7 (Target Id: 7)
Virtual Drive: 8 (Target Id: 8)
Virtual Drive: 9 (Target Id: 9)
Virtual Drive: 11 (Target Id: 11)
Virtual Drive: 12 (Target Id: 12)

There is a foreign configuration remaining.

btullis@an-worker1086:~$ sudo megacli -CfgForeign -Scan -a0
                                     
There are 1 foreign configuration(s) on controller 0.

Exit Code: 0x00

Cleared that foreign configuration.

btullis@an-worker1086:~$ sudo megacli -CfgForeign -Clear -a0
                                     
Foreign configuration 0 is cleared on controller 0.

Exit Code: 0x00

Added the new configuration:

btullis@an-worker1086:~$ sudo megacli -CfgLdAdd -r0 [32:9] -AfterLd9 -a0
                                     
Adapter 0: Created VD 10

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

The new disk has been detected as /dev/sdk

btullis@an-worker1086:~$ sudo dmesg -T |tail
[Mon Dec  4 19:55:39 2023] megaraid_sas 0000:03:00.0: 1507185 (755034896s/0x0001/FATAL) - VD 0a/c is now OFFLINE
[Tue Dec  5 06:24:37 2023] Process accounting resumed
[Wed Dec  6 06:24:36 2023] Process accounting resumed
[Wed Dec  6 18:17:45 2023] scsi 0:2:10:0: Direct-Access     DELL     PERC H730 Mini   4.29 PQ: 0 ANSI: 5
[Wed Dec  6 18:17:45 2023] sd 0:2:10:0: Attached scsi generic sg10 type 0
[Wed Dec  6 18:17:45 2023] sd 0:2:10:0: [sdk] 7812939776 512-byte logical blocks: (4.00 TB/3.64 TiB)
[Wed Dec  6 18:17:45 2023] sd 0:2:10:0: [sdk] Write Protect is off
[Wed Dec  6 18:17:45 2023] sd 0:2:10:0: [sdk] Mode Sense: 1f 00 00 08
[Wed Dec  6 18:17:45 2023] sd 0:2:10:0: [sdk] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Wed Dec  6 18:17:45 2023] sd 0:2:10:0: [sdk] Attached SCSI disk

Executed the following:

sudo apt install parted
sudo parted /dev/sdk --script mklabel gpt
sudo parted /dev/sdk --script mkpart primary ext4 0% 100%
sudo mkfs.ext4 -L hadoop-k /dev/sdk1
sudo tune2fs -m 0 /dev/sdk1
sudo lsblk -i -fs

Obtained the uuid value for /dev/sdk1 and added it to the /etc/fstab file.

Executed sudo mount -a

Restarted hadoop-yarn-nodemanager and hadoop-hdfs-datanode services.

Mentioned in SAL (#wikimedia-analytics) [2023-12-06T18:27:03Z] <btullis> restarted hadoop-yarn-nodemanager and hadoop-hdfs-datanode services on an-worker1086 for T352168