Page MenuHomePhabricator

cloudvirt1009: Device not healthy -SMART-
Closed, ResolvedPublic

Description

@MoritzMuehlenhoff reported we have an icinga alert for cloudvirt1009 related to the disk.

However, I checked this:

aborrero@cloudvirt1009:~ $ sudo smartctl -H /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

aborrero@cloudvirt1009:~ $ sudo smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-11-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptFeb 12 2020, 11:22 AM
aborrero closed this task as Resolved.Feb 12 2020, 11:41 AM
aborrero claimed this task.

Apparently the issue resolved itself:

Volans reopened this task as Open.Feb 12 2020, 2:29 PM
Volans added a subscriber: Volans.

Re-opening as it's currently alerting.

It's all documented in the link that is associated with the alert itself:
https://wikitech.wikimedia.org/wiki/SMART#Alerts

From the Icinga alert pick the disk (cciss,17):

cluster=wmcs device=cciss,17 instance=cloudvirt1009:9100 job=node site=eqiad

Then run smart-data-dump and see how the affected disk is referred to:

DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --info --health -d cciss,17 /dev/sda
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --attributes -d cciss,17 /dev/sda

So then you can run /usr/sbin/smartctl --info --health -d cciss,17 /dev/sda that at the end has:

=== START OF READ SMART DATA SECTION ===
SMART Health Status: SPINDLE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=50]

And also see the attributes with the other command if you're interested.

aborrero reassigned this task from aborrero to Jclark-ctr.Feb 13 2020, 5:16 PM
aborrero added subscribers: Jclark-ctr, Cmjohnson.

General hard drive failure doesn't sound good. Please @Jclark-ctr @Cmjohnson advice how to proceed.

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Feb 18 2020, 9:24 PM

@wiki_willy I have checked our storage room we have no spares host is 5 years old at the time drive needed is a 300gb 15k sas. current drive installed part number is 748385

@aborrero (and @Jclark-ctr for visibility) - it looks this was purchased back in 2014, and past the 5yr server life cycle. Would it be possible to decommission this host instead? Let us know if there's a corresponding install task, that we can help with, in getting this one refreshed. Thanks, Willy

bd808 added a comment.Feb 27 2020, 2:20 PM

@aborrero (and @Jclark-ctr for visibility) - it looks this was purchased back in 2014, and past the 5yr server life cycle. Would it be possible to decommission this host instead? Let us know if there's a corresponding install task, that we can help with, in getting this one refreshed. Thanks, Willy

@wiki_willy {T243471} is related to replacement. The cloudvirt1001 - cloudvirt1009 hosts are all due for refresh in Q4 according to the hardware procurement spreadsheet. The task I linked is us trying to get a few of those done sooner exactly because we have so many active hardware issue right now. We are in a "fun" spot where these disk tickets are getting pushback because the hosts are old while at the same time the procurement for replacements is de-prioritized because it is considered an out of order request.

@bd808 - thanks for providing the background context around these. I hit up Rob to prioritize T243471 more. (quotes being submitted soon) Also, we'll go ahead and order a replacement disk for cloudvirt1009 as well. Thanks, Willy

wiki_willy added a subtask: Unknown Object (Task).Feb 27 2020, 5:24 PM

T246365 created for ordering the replacement drive. Thanks, Willy

Mentioned in SAL (#wikimedia-cloud) [2020-02-29T16:32:12Z] <bstorm_> downtimed the smart alert on cloudvirt1009 until Monday since apparently predictive failures flap T244986

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 10 2020, 8:57 PM

replaced drive slot 17, removed 2nd failed drive in slot 18

Cleaned up the RAID config with hpssacli ctrl slot=0 array b remove spares=2I:1:18

JHedden closed this task as Resolved.Mar 11 2020, 8:53 PM