Page MenuHomePhabricator

bast3002 sdb broken
Closed, ResolvedPublic

Description

Spotted this today on bast3002, sdb is extremely slow and broken, hasn't been kicked out of the array yet though by mdadm

[10660197.262622] ata2: EH complete
[10660199.576662] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10660199.583362] ata2.00: BMDMA stat 0x24
[10660199.587195] ata2.00: failed command: READ DMA EXT
[10660199.592156] ata2.00: cmd 25/00:08:20:0a:1d/00:00:13:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:20:0a:1d/40:00:13:00:00/e0 Emask 0x9 (media error)
[10660199.607631] ata2.00: status: { DRDY ERR }
[10660199.611893] ata2.00: error: { UNC }
[10660199.789271] ata2.00: configured for UDMA/133
[10660199.789293] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10660199.789298] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[10660199.789302] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[10660199.789307] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 13 1d 0a 20 00 00 08 00
[10660199.789310] blk_update_request: I/O error, dev sdb, sector 320670240
[10660199.795934] ata2: EH complete
[10660199.819545] md/raid1:md2: read error corrected (8 sectors at 221059616 on sdb3)
[10660202.044363] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10660202.051057] ata2.00: BMDMA stat 0x24
[10660202.054893] ata2.00: failed command: READ DMA EXT
[10660202.059857] ata2.00: cmd 25/00:08:28:0a:1d/00:00:13:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:28:0a:1d/40:00:13:00:00/e0 Emask 0x9 (media error)
[10660202.075303] ata2.00: status: { DRDY ERR }
[10660202.079571] ata2.00: error: { UNC }
[10660202.256640] ata2.00: configured for UDMA/133
[10660202.256659] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10660202.256664] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[10660202.256668] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[10660202.256673] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 13 1d 0a 28 00 00 08 00
[10660202.256675] blk_update_request: I/O error, dev sdb, sector 320670248
[10660202.263329] ata2: EH complete
[10660202.304770] md/raid1:md2: read error corrected (8 sectors at 221059624 on sdb3)
[10660202.304785] md/raid1:md2: redirecting sector 220797440 to other mirror: sda3

Event Timeline

ema triaged this task as High priority.Jun 29 2017, 7:02 AM

Today Jun 29th at 07:34 AM bast3002 was entirely unreachable for about 3 minutes. During that time, I've logged in in console to find kernel logs such as those posted by @fgiunchedi above. @Volans suggested that we mark the hard drive as failed, which seems like a good idea given that mdadm hasn't noticed anything as of yet.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T13:54:15Z] <godog> kick sdb out of mdadm arrays on bast3002 - T169035

I've commented out the MAILADDR line to avoid to get one email per day. Given that we have also the Icinga check we could consider to comment it out broadly across the fleet. The file is currently not managed by puppet.

And of course that was not enough, I had to also add an exit 0 to /etc/cron.daily/mdadm to prevent it from running, without the MAILADDR setting the report check refuses to run and generates cronspam.

Opened T169564 for the mdadm configuration.

These systems are 3,5" drives, not hot swap. We have a lot of SFF spares (mostly SSDs), but no LFF, and these are well out of warranty. I could steal a drive from one of the (many) other decom'ed servers there - but then isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon?

isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon?

See subtask, i would just take the first of the recently decom'ed swift boxes as a temp. fix, ok?

bast3002 (aka hooft) sdb has been swapped with amslvs3's sdb (both LFF, non-hotswap). The server has booted back up, and sdb is being repartitioned and added to the RAID arrays.

mark claimed this task.