Page MenuHomePhabricator

Degraded RAID on elastic1046
Open, NormalPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic1046. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      1503967232 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0] sdb1[1](F)
      29279232 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Related Objects

StatusAssignedTask
OpenCmjohnson

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-07-22T07:54:48Z] <elukey> sudo -i depool on elastic1046 - broken disk (srv partition not available) - T228606

elukey added a subscriber: elukey.Jul 22 2019, 7:55 AM
elukey@elastic1046:~$ sudo -i depool
Depooling all services on elastic1046.eqiad.wmnet
eqiad/elasticsearch/elasticsearch/elastic1046.eqiad.wmnet: pooled changed yes => no
eqiad/elasticsearch/elasticsearch-psi-ssl/elastic1046.eqiad.wmnet: pooled changed yes => no
eqiad/elasticsearch/elasticsearch-ssl/elastic1046.eqiad.wmnet: pooled changed yes => no
elukey triaged this task as Normal priority.Jul 22 2019, 8:05 AM
This comment was removed by Cmjohnson.
Volans added a subscriber: Gehel.Jul 23 2019, 6:43 AM
Cmjohnson reassigned this task from Cmjohnson to wiki_willy.Jul 24 2019, 5:39 PM
Cmjohnson added a subscriber: wiki_willy.

This server is out of warranty, ended April 2019. @wiki_willy escalating to you to decide on disks

@elukey - since elastic1046 is just barely out of warranty (only by a few months), we'll still have to purchase a new disk for this server. Just double-checking that's the route you want to go, before we place the order.

Thanks,
Willy

elukey added a subscriber: dcausse.Jul 25 2019, 8:25 AM

Adding @dcausse to the conversation since @Gehel is on holiday. I would simply buy the disk now, but not sure if elastic1046 is scheduled to be refreshed soon.

The host is not scheduled for replacement, @wiki_willy please proceed with the order of the disk :)

jijiki added a subscriber: jijiki.Jul 25 2019, 11:56 AM

Thanks @elukey, subtask #T229017 has been opened to order the replacement drive with procurement. Assigning this task back to @Cmjohnson, for when the disk arrives onsite.

Thanks,
Willy

Confirmed by Chris that the drive arrived on August 8

I replaced the failed disk

Gehel claimed this task.Aug 23 2019, 3:38 PM

@Cmjohnson thanks! I'll take it over and reimage

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201908271218_gehel_128010.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

Of which those FAILED:

['elastic1046.eqiad.wmnet']

@Cmjohnson: it looks like the installer only sees a single disk, and thus can't partition. Could you check? Thanks!

Gehel reassigned this task from Gehel to Cmjohnson.Aug 27 2019, 12:56 PM

@Cmjohnson - could be the drive is seated securely or possibly a loose cable /connection

jijiki removed a subscriber: jijiki.Sep 11 2019, 12:44 PM

@wiki_willy not really but I reseated it anyway. As far as I can tell in bios everything looks normal. I did swap the 2 disks. @Gehel try again please.

I did notice that ssds are different types
The new ssd is a DC3320 series
The old ssd is a DC3610 series

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909130742_gehel_73868.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

Of which those FAILED:

['elastic1046.eqiad.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909180937_gehel_148677.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

Of which those FAILED:

['elastic1046.eqiad.wmnet']

Tried again, seems that sda has issues (see log below). Is it that the second disk also failed? Or that the wrong disk was replaced? Or something else?

@Cmjohnson any idea what to do at this point?

Sep 18 09:41:13 kernel: [   50.172541] ata1: EH complete
Sep 18 09:41:13 kernel: [   50.172569] Dev sda: unable to read RDB block 0
Sep 18 09:41:13 kernel: [   50.183837] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 18 09:41:13 kernel: [   50.183840] ata1.00: irq_stat 0x40000001
Sep 18 09:41:13 kernel: [   50.183843] ata1.00: failed command: READ DMA
Sep 18 09:41:13 kernel: [   50.183853] ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 13 dma 4096 in
Sep 18 09:41:13 kernel: [   50.183853]          res 53/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Sep 18 09:41:13 kernel: [   50.183855] ata1.00: status: { DRDY SENSE ERR }
Sep 18 09:41:13 kernel: [   50.183857] ata1.00: error: { IDNF }
Sep 18 09:41:13 kernel: [   50.184508] ata1.00: configured for UDMA/100
Sep 18 09:41:13 kernel: [   50.184520] ata1: EH complete
Sep 18 09:41:13 kernel: [   50.195837] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 18 09:41:13 kernel: [   50.195840] ata1.00: irq_stat 0x40000001
Sep 18 09:41:13 kernel: [   50.195844] ata1.00: failed command: READ DMA

@Cmjohnson - let me know if we need to order a replacement drive (along with what type of disk), since it's out of warranty. Thanks, Willy