Page MenuHomePhabricator

Medium error reported for sda on elastic2045
Closed, ResolvedPublic

Description

Hi!

I just rebooted elastic2045 for some issues and when it got back, I noticed the following in dmesg:

[Mon Feb 22 07:16:33 2021] ata3.00: status: { DRDY }
[Mon Feb 22 07:16:33 2021] ata3: hard resetting link
[Mon Feb 22 07:16:34 2021] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Mon Feb 22 07:16:34 2021] ata3.00: configured for UDMA/133
[Mon Feb 22 07:16:34 2021] ata3: EH complete
[Mon Feb 22 07:16:34 2021] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Mon Feb 22 07:16:34 2021] ata3.00: irq_stat 0x40000001
[Mon Feb 22 07:16:34 2021] ata3.00: failed command: READ DMA
[Mon Feb 22 07:16:34 2021] ata3.00: cmd c8/00:08:08:08:00/00:00:00:00:00/e0 tag 3 dma 4096 in
                                    res 51/40:08:09:08:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Mon Feb 22 07:16:34 2021] ata3.00: status: { DRDY ERR }
[Mon Feb 22 07:16:34 2021] ata3.00: error: { UNC }
[Mon Feb 22 07:16:34 2021] ata3.00: configured for UDMA/133
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 Sense Key : Medium Error [current] 
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
[Mon Feb 22 07:16:34 2021] blk_update_request: I/O error, dev sda, sector 2057
[Mon Feb 22 07:16:34 2021] ata3: EH complete
[Mon Feb 22 07:16:34 2021] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Mon Feb 22 07:16:34 2021] ata3.00: irq_stat 0x40000001
[Mon Feb 22 07:16:34 2021] ata3.00: failed command: READ DMA
[Mon Feb 22 07:16:34 2021] ata3.00: cmd c8/00:08:08:08:00/00:00:00:00:00/e0 tag 7 dma 4096 in
                                    res 51/40:08:09:08:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Mon Feb 22 07:16:34 2021] ata3.00: status: { DRDY ERR }
[Mon Feb 22 07:16:34 2021] ata3.00: error: { UNC }
[Mon Feb 22 07:16:34 2021] ata3.00: configured for UDMA/133
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 Sense Key : Medium Error [current] 
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
[Mon Feb 22 07:16:34 2021] blk_update_request: I/O error, dev sda, sector 2057
[Mon Feb 22 07:16:34 2021] Buffer I/O error on dev sda1, logical block 1, async page read
[Mon Feb 22 07:16:34 2021] ata3: EH complete

The disk needs to be replaced, even if I am not sure if the host is OOW or not :)

Related Objects

Event Timeline

elukey created this task.Mon, Feb 22, 7:25 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Feb 22, 7:25 AM

Mentioned in SAL (#wikimedia-operations) [2021-02-22T08:39:04Z] <gehel> depool elastic2045 and ban from clsuters - T275345

Gehel added a subscriber: Papaul.Mon, Feb 22, 8:43 AM

@Papaul this server is depooled and banned from the cluster. Can you replace sda? This should still be under warranty.

@RKemper: once the new disk is in place, can you reimage and un-ban?

MoritzMuehlenhoff triaged this task as Medium priority.Mon, Feb 22, 8:45 AM
Papaul claimed this task.Mon, Feb 22, 1:41 PM
Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.
RKemper added a comment.EditedWed, Feb 24, 3:25 AM

@Gehel Yup I can get elastic2045 re-imaged and unbanned once we get sda replaced

Papaul reassigned this task from Papaul to RKemper.Thu, Feb 25, 6:59 PM

@Gehel @RKemper disk replaced. Please resolve task when re-image is done.

Thanks

Papaul closed this task as Resolved.Thu, Feb 25, 9:24 PM

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2045.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102260504_ryankemper_27160_elastic2045_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-02-26T05:07:51Z] <ryankemper> T275345 sudo -i wmf-auto-reimage-host --conftool -p T275345 elastic2045.codfw.wmnet on ryankemper@cumin2001 tmux session elastic_reimage_elastic1065

Side note: Just noticed I named the tmux session elastic1065. Fortunately as can be seen above we're reimaging the proper host, elastic2045 :P

Completed auto-reimage of hosts:

['elastic2045.codfw.wmnet']

and were ALL successful.