Degraded RAID on bast3001
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	ops-monitoring-bot
	Jan 4 2017, 6:48 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID was detected on host bast3001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [raid1] 
md2 : active raid1 sda3[0] sdb3[1](F)
      438449152 blocks super 1.2 [2/1] [U_]
      bitmap: 2/4 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[2] sdb1[1](F)
      48794624 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Related Objects
Search...

Status	Assigned	Task
Declined	None	T154603 Degraded RAID on bast3001
Resolved	Dzahn	T156506 Replace bast3001
Resolved	Papaul	T159480 Decommission bast3001

Event Timeline

ops-monitoring-bot added projects: SRE, ops-esams.Jan 4 2017, 6:48 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2017, 6:48 PM

We've seen this in the past with T152339: Degraded RAID on bast3001 and it looked like controller failure

This time around sda does not exhibit any errors. I am the only one currently using this box, so it's a good time to reboot. I 'll try to do that same dance as for T152339 and see what happens.

Mentioned in SAL (#wikimedia-operations) [2017-01-05T11:13:21Z] <akosiaris> rebooting bast3001, T154603

So, sda did not indeed exhibit any errors, sdb was kicked out 2 of the 3 arrays (it was kept in the swap array as there was no read/write activity there) and I was unable to get any info from it using smartctl, while being able to get info from sda. After the reboot, smartctl is able again to get information from sdb. smartctl reports

SMART overall-health self-assessment test result: PASSED

which does not really definitive. In this case, both a short and an extended offline test succeeded without error. The disk does have some errors logged but they date back 32 days, which is around the time of T152339.

I readded (mdadd --add) the partitions in the 2 arrays. md0 has already resynced and md2 Resync is ongoing. Will monitor it, but things look fine currently

The resync is done and neither sda nor sdb logged any kind of errors during the resync process which further enforces the controller issue theory. That being said, I don't see any hardware raid controller on that box, it looks like the disks are attached directly to the motherboard via the ICH10 standard SATA controllers. The box seems to have a 4 port SATA controller 2 port SATA controller. So I doubt this is something we can change without changing the motherboard.

I think I am gonna resolve this for now and decide how to act on this if it happens again.

Reopening, smartd just sent the following for bast3001

Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors

Device info:
WDC WD5002ABYS-18B1B0, S/N:WD-WCASYA596674, WWN:5-0014ee-2041e23af, FW:02.03B04, 500 GB

md has not yet kicked the drive out of the array, but I am thinking this will happen soon. I 'll be monitoring

This message was generated by the smartd daemon running on:

   host name:  bast3001
   DNS domain: wikimedia.org

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], ATA error count increased from 2737 to 2741

Device info:
WDC WD5002ABYS-18B1B0, S/N:WD-WCASYA596674, WWN:5-0014ee-2041e23af, FW:02.03B04, 500 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Sat Jan 14 04:15:05 2017 UTC

Yeah this has been happening for days. The disk is not yet kicked out of the array, which buffles me since the dmesg has many

[1636325.780704] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[1636325.787315] ata2.00: BMDMA stat 0x24
[1636325.791063] ata2.00: failed command: READ DMA
[1636325.795590] ata2.00: cmd c8/00:08:98:0d:81/00:00:00:00:00/e4 tag 0 dma 4096 in
         res 51/40:00:98:0d:81/40:00:0e:00:00/e4 Emask 0x9 (media error)
[1636325.810858] ata2.00: status: { DRDY ERR }
[1636325.815033] ata2.00: error: { UNC }
[1636325.993242] ata2.00: configured for UDMA/133
[1636325.993263] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1636325.993268] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[1636325.993272] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[1636325.993277] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 04 81 0d 98 00 00 08 00
[1636325.993279] blk_update_request: I/O error, dev sdb, sector 75566488
[1636325.999861] ata2: EH complete
[1636326.071504] md/raid1:md0: read error corrected (8 sectors at 75564440 on sdb1)

log entries.

Anyway, this time around it seems like the problem is the disk, not the controller. We could force kick the disk out of the array, it's probably useless anyways.

Mentioned in SAL (#wikimedia-operations) [2017-01-24T09:53:17Z] <akosiaris> mark /dev/sdb as faulty on md devices on bast3001 T154603

Forced the disk as failed. I suppose we should schedule a replacement. In the meantime bast3001 will work at reduced redundancy, which is fine given we got another 3 bast boxes

akosiaris merged a task: T156116: Degraded RAID on bast3001.Jan 24 2017, 9:56 AM

faidon mentioned this in T156506: Replace bast3001.Jan 27 2017, 5:45 PM

I arrived here through cronspam, added T156506 as a subtask (meaning it depends on, it is obviously not a subtask) so others do not lose time next time.

• ema triaged this task as Medium priority.Feb 27 2017, 1:41 PM

Dzahn added a subtask: T159480: Decommission bast3001.Mar 2 2017, 9:37 PM