Page MenuHomePhabricator

elastic2054 unresponsive
Closed, ResolvedPublic

Description

elastic2054 is unresponsive, most icinga checks are in error (but not all). The system is up, but login via serial console is taking forever.

Note that this node had memory issues before: T227298

Event Timeline

Gehel triaged this task as High priority.Thu, Feb 11, 3:38 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Bunch of IO errors:

Feb 11 15:42:43 elastic2054 kernel: [    7.256609] md: bind<sdb1>
Feb 11 15:42:43 elastic2054 kernel: [    7.273796] ata3.00: exception Emask 0x0 SAct 0x2400 SErr 0x0 action 0x0
Feb 11 15:42:43 elastic2054 kernel: [    7.280494] ata3.00: irq_stat 0x40000009
Feb 11 15:42:43 elastic2054 kernel: [    7.284427] ata3.00: failed command: READ FPDMA QUEUED
Feb 11 15:42:43 elastic2054 kernel: [    7.289565] ata3.00: cmd 60/08:50:08:08:00/00:00:00:00:00/40 tag 10 ncq dma 4096 in
Feb 11 15:42:43 elastic2054 kernel: [    7.289565]          res 51/40:08:08:08:00/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Feb 11 15:42:43 elastic2054 kernel: [    7.305434] ata3.00: status: { DRDY ERR }
Feb 11 15:42:43 elastic2054 kernel: [    7.309436] ata3.00: error: { UNC }
Feb 11 15:42:43 elastic2054 kernel: [    7.313487] ata3.00: configured for UDMA/133
Feb 11 15:42:43 elastic2054 kernel: [    7.317759] sd 2:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 11 15:42:43 elastic2054 kernel: [    7.326091] sd 2:0:0:0: [sda] tag#10 Sense Key : Medium Error [current] 
Feb 11 15:42:43 elastic2054 kernel: [    7.332788] sd 2:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed
Feb 11 15:42:43 elastic2054 kernel: [    7.341548] sd 2:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00
Feb 11 15:42:43 elastic2054 kernel: [    7.348921] blk_update_request: I/O error, dev sda, sector 2056
Feb 11 15:42:43 elastic2054 kernel: [    7.354833] ata3: EH complete
Feb 11 15:42:43 elastic2054 kernel: [    7.381346] ata3.00: exception Emask 0x0 SAct 0xa0000 SErr 0x0 action 0x0
Feb 11 15:42:43 elastic2054 kernel: [    7.388129] ata3.00: irq_stat 0x40000008

Mentioned in SAL (#wikimedia-operations) [2021-02-11T15:50:49Z] <gehel> ban elastic2054 from shard allocation - T274555

Gehel added a project: ops-codfw.
Gehel added a subscriber: Papaul.

@Papaul : it looks like sda is failing, confirmed by T274556. The server is depooled and banned from the cluster. Could you do your magic to find a new SSD?

Thanks!

@Gehel the server is under warranty, I can request a replacement disk for sda.

@Gehel the server is under warranty, I can request a replacement disk for sda.

Yes, please request that new disk!

Create Dispatch: Success
You have successfully submitted request SR1051498857.

@Gehel disk replaced . Please resolve the task once reimage is complete .

Thanks

Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:

elastic2054.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103030529_ryankemper_12406_elastic2054_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-03-03T05:31:30Z] <ryankemper> T274555 sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-03-03T05:32:36Z] <ryankemper> T274555 sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet on ryankemper@cumin2001 tmux session elastic_reimage_elastic2054

Completed auto-reimage of hosts:

['elastic2054.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:15:07Z] <ryankemper> T274555 Removed downtime for elastic2054

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:17:13Z] <ryankemper> T275345 T274555 Unbanning elastic2045 and elastic2054 from our cluster now that both hosts have been re-imaged and are running without errors (commands follow)

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:18:19Z] <ryankemper> T275345 T274555 curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}' => {"acknowledged":true,"persistent":{},"transient":{}}

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:20:27Z] <ryankemper> T275345 T274555 curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}' => {"acknowledged":true,"persistent":{},"transient":{}}

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:21:13Z] <ryankemper> T275345 T274555 Re-pooling elastic2045 and elastic2054 (commands follow)

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:26:54Z] <ryankemper> T275345 T274555 sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes on ryankemper@puppetmaster1001

Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:27:02Z] <ryankemper> T275345 T274555 sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes on ryankemper@puppetmaster1001