elastic2054 is unresponsive, most icinga checks are in error (but not all). The system is up, but login via serial console is taking forever.
Note that this node had memory issues before: T227298
elastic2054 is unresponsive, most icinga checks are in error (but not all). The system is up, but login via serial console is taking forever.
Note that this node had memory issues before: T227298
Mentioned in SAL (#wikimedia-operations) [2021-02-11T15:39:53Z] <gehel> powercycle elastic2054 - T274555
Mentioned in SAL (#wikimedia-operations) [2021-02-11T15:46:14Z] <gehel> depooling elastic2054 - T274555
Bunch of IO errors:
Feb 11 15:42:43 elastic2054 kernel: [ 7.256609] md: bind<sdb1> Feb 11 15:42:43 elastic2054 kernel: [ 7.273796] ata3.00: exception Emask 0x0 SAct 0x2400 SErr 0x0 action 0x0 Feb 11 15:42:43 elastic2054 kernel: [ 7.280494] ata3.00: irq_stat 0x40000009 Feb 11 15:42:43 elastic2054 kernel: [ 7.284427] ata3.00: failed command: READ FPDMA QUEUED Feb 11 15:42:43 elastic2054 kernel: [ 7.289565] ata3.00: cmd 60/08:50:08:08:00/00:00:00:00:00/40 tag 10 ncq dma 4096 in Feb 11 15:42:43 elastic2054 kernel: [ 7.289565] res 51/40:08:08:08:00/00:00:00:00:00/40 Emask 0x409 (media error) <F> Feb 11 15:42:43 elastic2054 kernel: [ 7.305434] ata3.00: status: { DRDY ERR } Feb 11 15:42:43 elastic2054 kernel: [ 7.309436] ata3.00: error: { UNC } Feb 11 15:42:43 elastic2054 kernel: [ 7.313487] ata3.00: configured for UDMA/133 Feb 11 15:42:43 elastic2054 kernel: [ 7.317759] sd 2:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Feb 11 15:42:43 elastic2054 kernel: [ 7.326091] sd 2:0:0:0: [sda] tag#10 Sense Key : Medium Error [current] Feb 11 15:42:43 elastic2054 kernel: [ 7.332788] sd 2:0:0:0: [sda] tag#10 Add. Sense: Unrecovered read error - auto reallocate failed Feb 11 15:42:43 elastic2054 kernel: [ 7.341548] sd 2:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 00 00 08 08 00 00 08 00 Feb 11 15:42:43 elastic2054 kernel: [ 7.348921] blk_update_request: I/O error, dev sda, sector 2056 Feb 11 15:42:43 elastic2054 kernel: [ 7.354833] ata3: EH complete Feb 11 15:42:43 elastic2054 kernel: [ 7.381346] ata3.00: exception Emask 0x0 SAct 0xa0000 SErr 0x0 action 0x0 Feb 11 15:42:43 elastic2054 kernel: [ 7.388129] ata3.00: irq_stat 0x40000008
Mentioned in SAL (#wikimedia-operations) [2021-02-11T15:50:49Z] <gehel> ban elastic2054 from shard allocation - T274555
Script wmf-auto-reimage was launched by ryankemper on cumin2001.codfw.wmnet for hosts:
elastic2054.codfw.wmnet
The log can be found in /var/log/wmf-auto-reimage/202103030529_ryankemper_12406_elastic2054_codfw_wmnet.log.
Mentioned in SAL (#wikimedia-operations) [2021-03-03T05:31:30Z] <ryankemper> T274555 sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet
Mentioned in SAL (#wikimedia-operations) [2021-03-03T05:32:36Z] <ryankemper> T274555 sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet on ryankemper@cumin2001 tmux session elastic_reimage_elastic2054
Completed auto-reimage of hosts:
['elastic2054.codfw.wmnet']
and were ALL successful.
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:15:07Z] <ryankemper> T274555 Removed downtime for elastic2054
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:17:13Z] <ryankemper> T275345 T274555 Unbanning elastic2045 and elastic2054 from our cluster now that both hosts have been re-imaged and are running without errors (commands follow)
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:18:19Z] <ryankemper> T275345 T274555 curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}' => {"acknowledged":true,"persistent":{},"transient":{}}
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:20:27Z] <ryankemper> T275345 T274555 curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}' => {"acknowledged":true,"persistent":{},"transient":{}}
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:21:13Z] <ryankemper> T275345 T274555 Re-pooling elastic2045 and elastic2054 (commands follow)
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:26:54Z] <ryankemper> T275345 T274555 sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes on ryankemper@puppetmaster1001
Mentioned in SAL (#wikimedia-operations) [2021-03-03T06:27:02Z] <ryankemper> T275345 T274555 sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes on ryankemper@puppetmaster1001