Degraded RAID on bast3001
Closed, DeclinedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID was detected on host bast3001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [raid1] 
md2 : active raid1 sda3[0] sdb3[1](F)
      438449152 blocks super 1.2 [2/1] [U_]
      bitmap: 2/4 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[2] sdb1[1](F)
      48794624 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2017, 6:48 PM

We've seen this in the past with T152339: Degraded RAID on bast3001 and it looked like controller failure

akosiaris added a subscriber: akosiaris.EditedJan 5 2017, 11:10 AM

This time around sda does not exhibit any errors. I am the only one currently using this box, so it's a good time to reboot. I 'll try to do that same dance as for T152339 and see what happens.

Mentioned in SAL (#wikimedia-operations) [2017-01-05T11:13:21Z] <akosiaris> rebooting bast3001, T154603

So, sda did not indeed exhibit any errors, sdb was kicked out 2 of the 3 arrays (it was kept in the swap array as there was no read/write activity there) and I was unable to get any info from it using smartctl, while being able to get info from sda. After the reboot, smartctl is able again to get information from sdb. smartctl reports

SMART overall-health self-assessment test result: PASSED

which does not really definitive. In this case, both a short and an extended offline test succeeded without error. The disk does have some errors logged but they date back 32 days, which is around the time of T152339.

I readded (mdadd --add) the partitions in the 2 arrays. md0 has already resynced and md2 Resync is ongoing. Will monitor it, but things look fine currently

The resync is done and neither sda nor sdb logged any kind of errors during the resync process which further enforces the controller issue theory. That being said, I don't see any hardware raid controller on that box, it looks like the disks are attached directly to the motherboard via the ICH10 standard SATA controllers. The box seems to have a 4 port SATA controller 2 port SATA controller. So I doubt this is something we can change without changing the motherboard.

akosiaris closed this task as Resolved.Jan 5 2017, 2:39 PM
akosiaris claimed this task.

I think I am gonna resolve this for now and decide how to act on this if it happens again.

akosiaris reopened this task as Open.Jan 11 2017, 3:21 PM

Reopening, smartd just sent the following for bast3001

Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors

Device info:
WDC WD5002ABYS-18B1B0, S/N:WD-WCASYA596674, WWN:5-0014ee-2041e23af, FW:02.03B04, 500 GB

md has not yet kicked the drive out of the array, but I am thinking this will happen soon. I 'll be monitoring

Dzahn added a subscriber: Dzahn.Jan 23 2017, 9:37 PM
This message was generated by the smartd daemon running on:

   host name:  bast3001
   DNS domain: wikimedia.org

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], ATA error count increased from 2737 to 2741

Device info:
WDC WD5002ABYS-18B1B0, S/N:WD-WCASYA596674, WWN:5-0014ee-2041e23af, FW:02.03B04, 500 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Jan 14 04:15:05 2017 UTC

Yeah this has been happening for days. The disk is not yet kicked out of the array, which buffles me since the dmesg has many

[1636325.780704] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[1636325.787315] ata2.00: BMDMA stat 0x24
[1636325.791063] ata2.00: failed command: READ DMA
[1636325.795590] ata2.00: cmd c8/00:08:98:0d:81/00:00:00:00:00/e4 tag 0 dma 4096 in
         res 51/40:00:98:0d:81/40:00:0e:00:00/e4 Emask 0x9 (media error)
[1636325.810858] ata2.00: status: { DRDY ERR }
[1636325.815033] ata2.00: error: { UNC }
[1636325.993242] ata2.00: configured for UDMA/133
[1636325.993263] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1636325.993268] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[1636325.993272] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[1636325.993277] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 04 81 0d 98 00 00 08 00
[1636325.993279] blk_update_request: I/O error, dev sdb, sector 75566488
[1636325.999861] ata2: EH complete
[1636326.071504] md/raid1:md0: read error corrected (8 sectors at 75564440 on sdb1)

log entries.

Anyway, this time around it seems like the problem is the disk, not the controller. We could force kick the disk out of the array, it's probably useless anyways.

Mentioned in SAL (#wikimedia-operations) [2017-01-24T09:53:17Z] <akosiaris> mark /dev/sdb as faulty on md devices on bast3001 T154603

Forced the disk as failed. I suppose we should schedule a replacement. In the meantime bast3001 will work at reduced redundancy, which is fine given we got another 3 bast boxes

jcrespo removed akosiaris as the assignee of this task.Feb 21 2017, 8:42 AM
jcrespo added a subscriber: jcrespo.

I arrived here through cronspam, added T156506 as a subtask (meaning it depends on, it is obviously not a subtask) so others do not lose time next time.

ema triaged this task as Normal priority.Feb 27 2017, 1:41 PM
Dzahn raised the priority of this task from Normal to High.Mar 2 2017, 9:39 PM

This has been replaced by bast3002 (T156506) and there is now a decom task at T159480. After decom is finished this can be closed too.

Dzahn lowered the priority of this task from High to Low.Mar 2 2017, 9:39 PM
Dzahn closed this task as Declined.Mar 6 2017, 10:25 PM

declined. we shut down bast3001 and replaced it with bast3002 and this hardware will be removed eventually.