Page MenuHomePhabricator

bast3002 sdb broken
Closed, ResolvedPublic

Description

Spotted this today on bast3002, sdb is extremely slow and broken, hasn't been kicked out of the array yet though by mdadm

[10660197.262622] ata2: EH complete
[10660199.576662] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10660199.583362] ata2.00: BMDMA stat 0x24
[10660199.587195] ata2.00: failed command: READ DMA EXT
[10660199.592156] ata2.00: cmd 25/00:08:20:0a:1d/00:00:13:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:20:0a:1d/40:00:13:00:00/e0 Emask 0x9 (media error)
[10660199.607631] ata2.00: status: { DRDY ERR }
[10660199.611893] ata2.00: error: { UNC }
[10660199.789271] ata2.00: configured for UDMA/133
[10660199.789293] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10660199.789298] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[10660199.789302] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[10660199.789307] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 13 1d 0a 20 00 00 08 00
[10660199.789310] blk_update_request: I/O error, dev sdb, sector 320670240
[10660199.795934] ata2: EH complete
[10660199.819545] md/raid1:md2: read error corrected (8 sectors at 221059616 on sdb3)
[10660202.044363] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10660202.051057] ata2.00: BMDMA stat 0x24
[10660202.054893] ata2.00: failed command: READ DMA EXT
[10660202.059857] ata2.00: cmd 25/00:08:28:0a:1d/00:00:13:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:28:0a:1d/40:00:13:00:00/e0 Emask 0x9 (media error)
[10660202.075303] ata2.00: status: { DRDY ERR }
[10660202.079571] ata2.00: error: { UNC }
[10660202.256640] ata2.00: configured for UDMA/133
[10660202.256659] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10660202.256664] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[10660202.256668] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[10660202.256673] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 13 1d 0a 28 00 00 08 00
[10660202.256675] blk_update_request: I/O error, dev sdb, sector 320670248
[10660202.263329] ata2: EH complete
[10660202.304770] md/raid1:md2: read error corrected (8 sectors at 221059624 on sdb3)
[10660202.304785] md/raid1:md2: redirecting sector 220797440 to other mirror: sda3

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2017, 8:13 AM
ema triaged this task as High priority.Jun 29 2017, 7:02 AM
ema added subscribers: Volans, ema.Jun 29 2017, 7:06 AM

Today Jun 29th at 07:34 AM bast3002 was entirely unreachable for about 3 minutes. During that time, I've logged in in console to find kernel logs such as those posted by @fgiunchedi above. @Volans suggested that we mark the hard drive as failed, which seems like a good idea given that mdadm hasn't noticed anything as of yet.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T13:54:15Z] <godog> kick sdb out of mdadm arrays on bast3002 - T169035

Volans added a comment.Jul 3 2017, 7:16 AM

I've commented out the MAILADDR line to avoid to get one email per day. Given that we have also the Icinga check we could consider to comment it out broadly across the fleet. The file is currently not managed by puppet.

Volans added a comment.Jul 3 2017, 7:33 AM

And of course that was not enough, I had to also add an exit 0 to /etc/cron.daily/mdadm to prevent it from running, without the MAILADDR setting the report check refuses to run and generates cronspam.

Volans added a comment.Jul 3 2017, 6:17 PM

Opened T169564 for the mdadm configuration.

Volans added a subscriber: mark.Aug 28 2017, 4:31 PM
mark added a comment.Aug 30 2017, 1:06 PM

These systems are 3,5" drives, not hot swap. We have a lot of SFF spares (mostly SSDs), but no LFF, and these are well out of warranty. I could steal a drive from one of the (many) other decom'ed servers there - but then isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon?

mark moved this task from Backlog to Break/Fix on the ops-esams board.Jan 3 2018, 1:24 PM
Dzahn added a subscriber: Dzahn.Jan 15 2018, 5:06 PM
Dzahn added a comment.Jan 19 2018, 1:19 AM

isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon?

See subtask, i would just take the first of the recently decom'ed swift boxes as a temp. fix, ok?

mark moved this task from Break/Fix to Next visit on the ops-esams board.Jul 3 2018, 1:04 PM
mark added a comment.Jul 4 2018, 12:45 PM

bast3002 (aka hooft) sdb has been swapped with amslvs3's sdb (both LFF, non-hotswap). The server has booted back up, and sdb is being repartitioned and added to the RAID arrays.

mark closed this task as Resolved.Jul 4 2018, 1:30 PM
mark claimed this task.
Dzahn awarded a token.Jul 24 2018, 4:39 PM