Page MenuHomePhabricator

Degraded RAID on elastic1046
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic1046. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      1503967232 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0] sdb1[1](F)
      29279232 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-07-22T07:54:48Z] <elukey> sudo -i depool on elastic1046 - broken disk (srv partition not available) - T228606

elukey@elastic1046:~$ sudo -i depool
Depooling all services on elastic1046.eqiad.wmnet
eqiad/elasticsearch/elasticsearch/elastic1046.eqiad.wmnet: pooled changed yes => no
eqiad/elasticsearch/elasticsearch-psi-ssl/elastic1046.eqiad.wmnet: pooled changed yes => no
eqiad/elasticsearch/elasticsearch-ssl/elastic1046.eqiad.wmnet: pooled changed yes => no
elukey triaged this task as Medium priority.Jul 22 2019, 8:05 AM
This comment was removed by Cmjohnson.
Cmjohnson added a subscriber: wiki_willy.

This server is out of warranty, ended April 2019. @wiki_willy escalating to you to decide on disks

@elukey - since elastic1046 is just barely out of warranty (only by a few months), we'll still have to purchase a new disk for this server. Just double-checking that's the route you want to go, before we place the order.

Thanks,
Willy

Adding @dcausse to the conversation since @Gehel is on holiday. I would simply buy the disk now, but not sure if elastic1046 is scheduled to be refreshed soon.

The host is not scheduled for replacement, @wiki_willy please proceed with the order of the disk :)

Thanks @elukey, subtask #T229017 has been opened to order the replacement drive with procurement. Assigning this task back to @Cmjohnson, for when the disk arrives onsite.

Thanks,
Willy

Confirmed by Chris that the drive arrived on August 8

@Cmjohnson thanks! I'll take it over and reimage

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201908271218_gehel_128010.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

Of which those FAILED:

['elastic1046.eqiad.wmnet']

@Cmjohnson: it looks like the installer only sees a single disk, and thus can't partition. Could you check? Thanks!

@Cmjohnson - could be the drive is seated securely or possibly a loose cable /connection

@wiki_willy not really but I reseated it anyway. As far as I can tell in bios everything looks normal. I did swap the 2 disks. @Gehel try again please.

I did notice that ssds are different types
The new ssd is a DC3320 series
The old ssd is a DC3610 series

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909130742_gehel_73868.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

Of which those FAILED:

['elastic1046.eqiad.wmnet']

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201909180937_gehel_148677.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

Of which those FAILED:

['elastic1046.eqiad.wmnet']

Tried again, seems that sda has issues (see log below). Is it that the second disk also failed? Or that the wrong disk was replaced? Or something else?

@Cmjohnson any idea what to do at this point?

Sep 18 09:41:13 kernel: [   50.172541] ata1: EH complete
Sep 18 09:41:13 kernel: [   50.172569] Dev sda: unable to read RDB block 0
Sep 18 09:41:13 kernel: [   50.183837] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 18 09:41:13 kernel: [   50.183840] ata1.00: irq_stat 0x40000001
Sep 18 09:41:13 kernel: [   50.183843] ata1.00: failed command: READ DMA
Sep 18 09:41:13 kernel: [   50.183853] ata1.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 13 dma 4096 in
Sep 18 09:41:13 kernel: [   50.183853]          res 53/10:00:00:00:00/00:00:00:00:00/00 Emask 0x81 (invalid argument)
Sep 18 09:41:13 kernel: [   50.183855] ata1.00: status: { DRDY SENSE ERR }
Sep 18 09:41:13 kernel: [   50.183857] ata1.00: error: { IDNF }
Sep 18 09:41:13 kernel: [   50.184508] ata1.00: configured for UDMA/100
Sep 18 09:41:13 kernel: [   50.184520] ata1: EH complete
Sep 18 09:41:13 kernel: [   50.195837] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep 18 09:41:13 kernel: [   50.195840] ata1.00: irq_stat 0x40000001
Sep 18 09:41:13 kernel: [   50.195844] ata1.00: failed command: READ DMA

@Cmjohnson - let me know if we need to order a replacement drive (along with what type of disk), since it's out of warranty. Thanks, Willy

@wiki_willy can you order a new disk? I tried logging in but being prompted for a password so I cannot get disk info.

wiki_willy added a subtask: Unknown Object (Task).Oct 22 2019, 4:10 PM

Procurement task created for Rob to order replacement drive. Thanks, Willy

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

['elastic1046.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201911042249_gehel_234323.log.

Completed auto-reimage of hosts:

['elastic1046.eqiad.wmnet']

and were ALL successful.

Host reimage, joined the cluster, repooled, no errors.

Netbox updated to reflect the host being active again.

It looks like we're all good!

Thanks @Gehel , thanks @Jclark-ctr - I'll go ahead and resolve this task. Thanks, Willy

wiki_willy closed subtask Unknown Object (Task) as Resolved.Nov 4 2019, 11:21 PM
wiki_willy mentioned this in Unknown Object (Task).

Change 548548 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: ensure python prometheus client in installed

https://gerrit.wikimedia.org/r/548548

Change 548548 merged by Gehel:
[operations/puppet@production] elasticsearch: ensure python prometheus client in installed

https://gerrit.wikimedia.org/r/548548

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Nov 13 2019, 4:01 PM