Page MenuHomePhabricator

Medium error reported for sda on elastic2045
Closed, ResolvedPublic

Description

Hi!

I just rebooted elastic2045 for some issues and when it got back, I noticed the following in dmesg:

[Mon Feb 22 07:16:33 2021] ata3.00: status: { DRDY }
[Mon Feb 22 07:16:33 2021] ata3: hard resetting link
[Mon Feb 22 07:16:34 2021] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Mon Feb 22 07:16:34 2021] ata3.00: configured for UDMA/133
[Mon Feb 22 07:16:34 2021] ata3: EH complete
[Mon Feb 22 07:16:34 2021] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Mon Feb 22 07:16:34 2021] ata3.00: irq_stat 0x40000001
[Mon Feb 22 07:16:34 2021] ata3.00: failed command: READ DMA
[Mon Feb 22 07:16:34 2021] ata3.00: cmd c8/00:08:08:08:00/00:00:00:00:00/e0 tag 3 dma 4096 in
                                    res 51/40:08:09:08:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Mon Feb 22 07:16:34 2021] ata3.00: status: { DRDY ERR }
[Mon Feb 22 07:16:34 2021] ata3.00: error: { UNC }
[Mon Feb 22 07:16:34 2021] ata3.00: configured for UDMA/133
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 Sense Key : Medium Error [current] 
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#3 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
[Mon Feb 22 07:16:34 2021] blk_update_request: I/O error, dev sda, sector 2057
[Mon Feb 22 07:16:34 2021] ata3: EH complete
[Mon Feb 22 07:16:34 2021] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[Mon Feb 22 07:16:34 2021] ata3.00: irq_stat 0x40000001
[Mon Feb 22 07:16:34 2021] ata3.00: failed command: READ DMA
[Mon Feb 22 07:16:34 2021] ata3.00: cmd c8/00:08:08:08:00/00:00:00:00:00/e0 tag 7 dma 4096 in
                                    res 51/40:08:09:08:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[Mon Feb 22 07:16:34 2021] ata3.00: status: { DRDY ERR }
[Mon Feb 22 07:16:34 2021] ata3.00: error: { UNC }
[Mon Feb 22 07:16:34 2021] ata3.00: configured for UDMA/133
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 Sense Key : Medium Error [current] 
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Feb 22 07:16:34 2021] sd 2:0:0:0: [sda] tag#7 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
[Mon Feb 22 07:16:34 2021] blk_update_request: I/O error, dev sda, sector 2057
[Mon Feb 22 07:16:34 2021] Buffer I/O error on dev sda1, logical block 1, async page read
[Mon Feb 22 07:16:34 2021] ata3: EH complete

The disk needs to be replaced, even if I am not sure if the host is OOW or not :)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-02-22T08:39:04Z] <gehel> depool elastic2045 and ban from clsuters - T275345

@Papaul this server is depooled and banned from the cluster. Can you replace sda? This should still be under warranty.

@RKemper: once the new disk is in place, can you reimage and un-ban?

@Gehel Yup I can get elastic2045 re-imaged and unbanned once we get sda replaced

@Gehel @RKemper disk replaced. Please resolve task when re-image is done.

Thanks

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2045.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102260504_ryankemper_27160_elastic2045_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-02-26T05:07:51Z] <ryankemper> T275345 sudo -i wmf-auto-reimage-host --conftool -p T275345 elastic2045.codfw.wmnet on ryankemper@cumin2001 tmux session elastic_reimage_elastic1065

Side note: Just noticed I named the tmux session elastic1065. Fortunately as can be seen above we're reimaging the proper host, elastic2045 :P

Completed auto-reimage of hosts:

['elastic2045.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-03-03T05:16:44Z] <ryankemper> T275345 ryankemper@elastic2045:~$ sudo apt-get upgrade wmf-elasticsearch-search-plugins

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:17:13Z] <ryankemper> T275345 T274555 Unbanning elastic2045 and elastic2054 from our cluster now that both hosts have been re-imaged and are running without errors (commands follow)

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:18:19Z] <ryankemper> T275345 T274555 curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}' => {"acknowledged":true,"persistent":{},"transient":{}}

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:20:27Z] <ryankemper> T275345 T274555 curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}' => {"acknowledged":true,"persistent":{},"transient":{}}

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:21:13Z] <ryankemper> T275345 T274555 Re-pooling elastic2045 and elastic2054 (commands follow)

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:26:54Z] <ryankemper> T275345 T274555 sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes on ryankemper@puppetmaster1001

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:27:02Z] <ryankemper> T275345 T274555 sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes on ryankemper@puppetmaster1001