Page MenuHomePhabricator

Degraded RAID on logstash1006
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host logstash1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0
Personalities : [raid0] [raid1] 
md0 : active raid1 sda2[0] sdb2[1](F)
      249869312 blocks super 1.2 [2/1] [U_]
      bitmap: 2/2 pages [8KB], 65536KB chunk

md1 : active raid0 sdd3[3] sda3[0] sdc3[2] sdb3[1]
      2906513408 blocks super 1.2 512k chunks
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2017, 11:39 PM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Aug 21 2017, 10:16 PM
Volans triaged this task as Normal priority.Aug 22 2017, 8:00 AM

A self dispatch has been ordered with Dell. Work Order: SR952745470

@Gehel The disk arrived, the disks are internal and the server will need to be taken down to replace. Let me know when you're ready for me to swap disk.

Gehel added a comment.Aug 23 2017, 3:09 PM

I'll take down the server right now, we should be able to live with 2 elasticsearch backends only without any issue. Let me know when the server is up again and I'll do a check before putting in back in the cluster.

Mentioned in SAL (#wikimedia-operations) [2017-08-23T15:10:03Z] <gehel> shutdown logstash1006 for disk replacement - T173679

@Gehel the disk has been swapped, I will re-install later this afternoon.

return shipping information
USPS 9202 3946 5301 2436 4467 08
FDX 9611918 2393026 73196722

Mentioned in SAL (#wikimedia-operations) [2017-08-25T07:55:09Z] <gehel> reimaging logstash1006 after change of failed disk - T173679

Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts:

['logstash1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201708250905_gehel_722.log.

Completed auto-reimage of hosts:

['logstash1006.eqiad.wmnet']

and were ALL successful.

Reimage completed, logstash1006 back in rotation.

@Cmjohnson: do you need to keep this task open?

Cmjohnson closed this task as Resolved.Aug 28 2017, 8:40 PM
Cmjohnson claimed this task.

@Gehel no, resolving....thx