bast3002 sdb broken
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jun 28 2017, 8:13 AM

Description

Spotted this today on bast3002, sdb is extremely slow and broken, hasn't been kicked out of the array yet though by mdadm

[10660197.262622] ata2: EH complete
[10660199.576662] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10660199.583362] ata2.00: BMDMA stat 0x24
[10660199.587195] ata2.00: failed command: READ DMA EXT
[10660199.592156] ata2.00: cmd 25/00:08:20:0a:1d/00:00:13:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:20:0a:1d/40:00:13:00:00/e0 Emask 0x9 (media error)
[10660199.607631] ata2.00: status: { DRDY ERR }
[10660199.611893] ata2.00: error: { UNC }
[10660199.789271] ata2.00: configured for UDMA/133
[10660199.789293] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10660199.789298] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[10660199.789302] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[10660199.789307] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 13 1d 0a 20 00 00 08 00
[10660199.789310] blk_update_request: I/O error, dev sdb, sector 320670240
[10660199.795934] ata2: EH complete
[10660199.819545] md/raid1:md2: read error corrected (8 sectors at 221059616 on sdb3)
[10660202.044363] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10660202.051057] ata2.00: BMDMA stat 0x24
[10660202.054893] ata2.00: failed command: READ DMA EXT
[10660202.059857] ata2.00: cmd 25/00:08:28:0a:1d/00:00:13:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:28:0a:1d/40:00:13:00:00/e0 Emask 0x9 (media error)
[10660202.075303] ata2.00: status: { DRDY ERR }
[10660202.079571] ata2.00: error: { UNC }
[10660202.256640] ata2.00: configured for UDMA/133
[10660202.256659] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[10660202.256664] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [descriptor] 
[10660202.256668] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[10660202.256673] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 13 1d 0a 28 00 00 08 00
[10660202.256675] blk_update_request: I/O error, dev sdb, sector 320670248
[10660202.263329] ata2: EH complete
[10660202.304770] md/raid1:md2: read error corrected (8 sectors at 221059624 on sdb3)
[10660202.304785] md/raid1:md2: redirecting sector 220797440 to other mirror: sda3

Related Objects
Search...

Status	Assigned	Task
Resolved	mark	T169035 bast3002 sdb broken
Invalid	None	T184936 install/designate other machine as esams bastion
Declined	None	T186021 reconfigure esams switch port for new bastion
Resolved	Papaul	T216199 decom bast3003 (65R8Q4J, formerly amslvs4)

Event Timeline

fgiunchedi created this task.Jun 28 2017, 8:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2017, 8:13 AM

• ema triaged this task as High priority.Jun 29 2017, 7:02 AM

Today Jun 29th at 07:34 AM bast3002 was entirely unreachable for about 3 minutes. During that time, I've logged in in console to find kernel logs such as those posted by @fgiunchedi above. @Volans suggested that we mark the hard drive as failed, which seems like a good idea given that mdadm hasn't noticed anything as of yet.

Mentioned in SAL (#wikimedia-operations) [2017-06-29T13:54:15Z] <godog> kick sdb out of mdadm arrays on bast3002 - T169035

fgiunchedi merged a task: T169220: Degraded RAID on bast3002.Jun 29 2017, 1:56 PM

fgiunchedi added a subscriber: ops-monitoring-bot.

I've commented out the MAILADDR line to avoid to get one email per day. Given that we have also the Icinga check we could consider to comment it out broadly across the fleet. The file is currently not managed by puppet.

And of course that was not enough, I had to also add an exit 0 to /etc/cron.daily/mdadm to prevent it from running, without the MAILADDR setting the report check refuses to run and generates cronspam.

Opened T169564 for the mdadm configuration.

Volans added a subscriber: mark.Aug 28 2017, 4:31 PM

These systems are 3,5" drives, not hot swap. We have a lot of SFF spares (mostly SSDs), but no LFF, and these are well out of warranty. I could steal a drive from one of the (many) other decom'ed servers there - but then isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon?

Volans merged a task: T177875: Degraded RAID on bast3002.Oct 25 2017, 3:05 PM

mark moved this task from Backlog to Hardware Failure / Repair on the ops-esams board.Jan 3 2018, 1:24 PM

Dzahn subscribed.Jan 15 2018, 5:06 PM

Dzahn mentioned this in T184936: install/designate other machine as esams bastion.Jan 15 2018, 5:11 PM

isn't it easier to simply install/designate one of the other machines as bastion until we refresh the misc cluster soon?

See subtask, i would just take the first of the recently decom'ed swift boxes as a temp. fix, ok?

Dzahn changed the status of subtask T184936: install/designate other machine as esams bastion from Open to Stalled.Jan 31 2018, 12:12 AM

fgiunchedi mentioned this in T192610: prometheus on bast3002 misbehaving.Apr 20 2018, 9:35 AM

mark moved this task from Hardware Failure / Repair to Procurement on the ops-esams board.Jul 3 2018, 1:04 PM

bast3002 (aka hooft) sdb has been swapped with amslvs3's sdb (both LFF, non-hotswap). The server has booted back up, and sdb is being repartitioned and added to the RAID arrays.

mark closed this task as Resolved.Jul 4 2018, 1:30 PM

mark claimed this task.

Dzahn awarded a token.Jul 24 2018, 4:39 PM

Dzahn closed subtask T184936: install/designate other machine as esams bastion as Invalid.Jul 24 2018, 5:59 PM

bast3002 sdb brokenClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

bast3002 sdb broken
Closed, ResolvedPublic
Actions

Related Objects
Search...