Page MenuHomePhabricator

Degraded RAID on relforge1001
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host relforge1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [linear] [multipath] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md1 : active raid10 sdd3[3] sdb3[1] sdc3[2](F) sda3[0]
      5762608128 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      
md0 : active raid10 sdd2[3] sdb2[1] sdc2[2](F) sda2[0]
      97590272 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
      
unused devices: <none>

Event Timeline

MoritzMuehlenhoff triaged this task as Medium priority.
MoritzMuehlenhoff added a subscriber: Gehel.

disk info:

gehel@relforge1001:~$ sudo smartctl -i /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-100-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Hitachi Ultrastar 7K3000
Device Model:     Hitachi HUA723030ALA640
Serial Number:    MK0331YHGG7N7A
LU WWN Device Id: 5 000cca 225c679ec
Firmware Version: MKAOA580
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb  2 17:47:59 2017 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

dmesg: P4873
smartcltl -a: P4872

@Cmjohnson : it was suggested (@Volans) to try to remove and re-insert /dev/sdc physically. Could you do that?

It looks like we have a 3To seagate drive in spares that might work as a replacement if remove / reinsert does not work. This server is planned for reimaging anyway (T151326).

There is no high value data or traffic on those servers, so playing a bit with this disk is low risk.

Change 335776 had a related patch set uploaded (by Gehel):
relforge - switch master to relforge1002

https://gerrit.wikimedia.org/r/335776

Change 335776 merged by Gehel:
relforge - switch master to relforge1002

https://gerrit.wikimedia.org/r/335776

Note to @Cmjohnson: before shutting down relforge1001 for maintenance, shards should be drained from it with es-tool ban-node 10.64.4.13. This can take a few hours. Ping me so that we can plan this operation if needed.

A ticket has been created with HP support. I will update task as more information becomes available.

Case ID: 5317039408
Case title:
Failed Hard Drive
Severity 3-Normal
Product serial number: MXQ5030543
Product number: 661190-B21
Submitted: 2/6/2017 1:30:35 PM
Last updated: 2/6/2017 1:30:35 PM
Source: Web
Case status: Received by HP

@Cmjohnson most probably a stupid question, but why doesn't the serial number shown by smartctl matches the one listed in the ticket? (/me obviously does not know enough about disk and support, but is willing to learn).

@Gehel not sure why it doesn't show on smartctl but I did hdparm and it shows. I need it for the ticket to HP.

cmjohnson@relforge1001:~$ sudo hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: Hitachi HUA723030ALA640
Serial Number: MK0333YHGBK47C
Firmware Revision: MKAOA580
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project D1697 Revision 0b
Standards:
Used: unknown (minor revision code 0x0029)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63

CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 5860533168
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
device size with M = 1024*1024: 2861588 MBytes
device size with M = 1000*1000: 3000592 MBytes (3000 GB)
cache/buffer size = unknown
Form Factor: 3.5 inch
Nominal Media Rotation Rate: 7200

Disks should hopefully shit today or first thing tomorrow.

the new disk is on-site please let me know when ready to swap out.

Relforge1001 is being drained right now, it should be ready in a few hours. Do you need to shut it down? Or is it a hot plug switch? In any case, just ping me before doing it so that I can keep an eye on things... but all should be good!

it's a hot swap disk. I will update the task once it swapped so you can
rebuild the raid.

I'll actually just reimage the machine (it is due for a reimage), but same result.

The disk has been swapped...ready for re-install----resolving this task