Degraded RAID on restbase-dev1004
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	May 26 2020, 3:23 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host restbase-dev1004. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid0] [raid1] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdd2[3] sdc2[2] sda2[0] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]
      
md2 : active raid0 sdc3[2] sdb3[1] sda3[0] sdd3[3]
      3004026880 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0] sdc1[2](F) sdb1[1] sdd1[3]
      29279232 blocks super 1.2 [4/3] [UU_U]
      
unused devices: <none>

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		hnowlan	T253607 Degraded RAID on restbase-dev1004
					Unknown Object (Task)

Event Timeline

ops-monitoring-bot created this task.May 26 2020, 3:23 AM

MoritzMuehlenhoff added a subscriber: hnowlan.May 26 2020, 8:59 AM

This host isn't using JBOD so this bad disk can be replaced at any point.

@Cmjohnson - looks like we're right on the border with the warranty for this one. Netbox shows May 12, 2017 as the install date. Can you see if the HP site allows us to RMA it? Thanks, Willy

@wiki_willy I submitted a ticket with HPE, we'll see what they say

Your case was successfully submitted. Please note your Case ID: 5347610050 for future reference.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.May 29 2020, 12:31 PM

the AHS log has been uploaded to HP per their request. Looks like we're going to be okay with the warranty thing

@wiki_willy I completely forgot but restbases have ssds that were purchased separately from the servers. I believe this is the task of the original purchase. T158795. Also, this is a past task for ordering ssds for Restbase-dev1006 https://phabricator.wikimedia.org/T224260.

We need to order another disk

Thanks @Cmjohnson - T255293 created for ordering the new disk. Thanks, Willy

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 21 2020, 10:22 PM

@hnowlan replacement drive arrived today can you confirm drive can just be replaced. will take care of tomorrow.

Hi @Jclark-ctr, if the replacement can be done with no downtime, go for it. If downtime is required let me know when you'll be doing the replacement and I'll take the sytem down.

Just for reference, I can't see the dead disk attached to the system at the moment but the serials for the good disks are:

BTHC632208VZ800NGN
BTHC6322095B800NGN
BTHC63220959800NGN

@hnowlan
Replaced failed drive

Failed drive ICN BTHC62300066800NGN

Thanks @Jclark-ctr ! We'll need to rebuild the raid0 that the cassandra storage is located upon.

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

restbase-dev1004.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202007241034_hnowlan_3641_restbase-dev1004_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase-dev1004.eqiad.wmnet']

and were ALL successful.

hnowlan added a project: Platform Team Workboards (Green).Jul 24 2020, 11:33 AM

hnowlan closed this task as Resolved.Jul 24 2020, 11:48 AM

wiki_willy added a project: DC-Ops.Jul 24 2020, 4:46 PM

Degraded RAID on restbase-dev1004Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on restbase-dev1004
Closed, ResolvedPublic
Actions

Related Objects
Search...