Page MenuHomePhabricator

Degraded RAID on restbase-dev1004
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host restbase-dev1004. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid0] [raid1] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdd2[3] sdc2[2] sda2[0] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]
      
md2 : active raid0 sdc3[2] sdb3[1] sda3[0] sdd3[3]
      3004026880 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0] sdc1[2](F) sdb1[1] sdd1[3]
      29279232 blocks super 1.2 [4/3] [UU_U]
      
unused devices: <none>

Related Objects

StatusSubtypeAssignedTask
Resolvedhnowlan

Event Timeline

This host isn't using JBOD so this bad disk can be replaced at any point.

wiki_willy added subscribers: Cmjohnson, wiki_willy.

@Cmjohnson - looks like we're right on the border with the warranty for this one. Netbox shows May 12, 2017 as the install date. Can you see if the HP site allows us to RMA it? Thanks, Willy

@wiki_willy I submitted a ticket with HPE, we'll see what they say

Your case was successfully submitted. Please note your Case ID: 5347610050 for future reference.

the AHS log has been uploaded to HP per their request. Looks like we're going to be okay with the warranty thing

@wiki_willy I completely forgot but restbases have ssds that were purchased separately from the servers. I believe this is the task of the original purchase. T158795. Also, this is a past task for ordering ssds for Restbase-dev1006 https://phabricator.wikimedia.org/T224260.

We need to order another disk

wiki_willy mentioned this in Unknown Object (Task).Jun 12 2020, 5:22 PM
wiki_willy added a subtask: Unknown Object (Task).

Thanks @Cmjohnson - T255293 created for ordering the new disk. Thanks, Willy

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 21 2020, 10:22 PM

@hnowlan replacement drive arrived today can you confirm drive can just be replaced. will take care of tomorrow.

Hi @Jclark-ctr, if the replacement can be done with no downtime, go for it. If downtime is required let me know when you'll be doing the replacement and I'll take the sytem down.

Just for reference, I can't see the dead disk attached to the system at the moment but the serials for the good disks are:

  • BTHC632208VZ800NGN
  • BTHC6322095B800NGN
  • BTHC63220959800NGN

@hnowlan
Replaced failed drive

Failed drive ICN BTHC62300066800NGN

Thanks @Jclark-ctr ! We'll need to rebuild the raid0 that the cassandra storage is located upon.

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

restbase-dev1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007241034_hnowlan_3641_restbase-dev1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase-dev1004.eqiad.wmnet']

and were ALL successful.