Change Details

[x] Provide FQDN of system. [x] If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down. [x] Put system into a failed state in Netbox. [x] Provide urgency of request, along with justification (redundancy, dependencies, etc) [x] Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) [x] Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input. ---- restbase1030.eqiad.wmnet has a failing SSD (see dmesg output below). The errors are causing one of three Cassandra instances/nodes hosted to exit with sigsegv. So long as the other two instances continue to function, we're better off leaving them in service, than shutting it down until repairs can be made. Errors are accumulating however, and the longer it is in a degraded state like this, the greater the possibility of inconsistency (so urgency is medium(ish)?). The SSD can be replaced at any time, but all three instances should be shutdown and prevented from restarting prior to doing so. I am happy to coordinate (and drop everything to do so), or someone else can do this with the following: ```lang=sh-session $ sudo rm /etc/cassandra-{a,b,c}/service-enabled $ sudo systemctl stop cassandra-a $ sudo systemctl stop cassandra-b $ sudo systemctl stop cassandra-c ``` Once the SSD is replaced, I can take it from there. NOTE: Normally we'd decommission the entire host and wait with minimal urgency for a repair, but [[ https://phabricator.wikimedia.org/T342148 | we are over-capacity ]] and no longer able to do so. 🙁 ---- #### dmesg output ``` [ 2042.187996] ata5.00: exception Emask 0x0 SAct 0xf880440f SErr 0x0 action 0x0 [ 2042.195050] ata5.00: irq_stat 0x40000008 [ 2042.198995] ata5.00: failed command: READ FPDMA QUEUED [ 2042.204147] ata5.00: cmd 60/00:b8:a8:72:c1/01:00:9b:00:00/40 tag 23 ncq dma 131072 in res 51/40:00:a8:72:c1/00:01:9b:00:00/40 Emask 0x409 (media error) <F> [ 2042.220222] ata5.00: status: { DRDY ERR } [ 2042.224240] ata5.00: error: { UNC } [ 2042.228286] ata5.00: configured for UDMA/133 [ 2042.228360] sd 4:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 2042.228365] sd 4:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [ 2042.228371] sd 4:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed [ 2042.228376] sd 4:0:0:0: [sdc] tag#23 CDB: Read(10) 28 00 9b c1 72 a8 00 01 00 00 [ 2042.228379] print_req_error: I/O error, dev sdc, sector 2613146280 [ 2042.234633] ata5: EH complete [ 2042.363986] ata5.00: exception Emask 0x0 SAct 0x1202fff8 SErr 0x0 action 0x0 [ 2042.371045] ata5.00: irq_stat 0x40000008 [ 2042.374992] ata5.00: failed command: READ FPDMA QUEUED [ 2042.380151] ata5.00: cmd 60/08:e0:60:73:c1/00:00:9b:00:00/40 tag 28 ncq dma 4096 in res 51/40:08:60:73:c1/00:00:9b:00:00/40 Emask 0x409 (media error) <F> [ 2042.396050] ata5.00: status: { DRDY ERR } [ 2042.400068] ata5.00: error: { UNC } [ 2042.404049] ata5.00: configured for UDMA/133 [ 2042.404078] sd 4:0:0:0: [sdc] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 2042.404080] sd 4:0:0:0: [sdc] tag#28 Sense Key : Medium Error [current] [ 2042.404082] sd 4:0:0:0: [sdc] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed [ 2042.404084] sd 4:0:0:0: [sdc] tag#28 CDB: Read(10) 28 00 9b c1 73 60 00 00 08 00 [ 2042.404086] print_req_error: I/O error, dev sdc, sector 2613146464 [ 2042.410275] ata5: EH complete ``` See also: {T344210}

[x] Provide FQDN of system. [x] If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down. [x] Put system into a failed state in Netbox. [x] Provide urgency of request, along with justification (redundancy, dependencies, etc) [x] Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) [x] Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input. ---- restbase1030.eqiad.wmnet has a failing SSD (see dmesg output below). The errors are causing one of three Cassandra instances/nodes hosted to exit with sigsegv. So long as the other two instances continue to function, we're better off leaving them in service, than shutting it down until repairs can be made. Errors are accumulating however, and the longer it is in a degraded state like this, the greater the possibility of inconsistency (so urgency is medium(ish)?). The SSD can be replaced at any time, but all three instances should be shutdown and prevented from restarting prior to doing so. I am happy to coordinate (and drop everything), or someone else can do this with the following: ```lang=sh-session $ sudo rm /etc/cassandra-{a,b,c}/service-enabled $ sudo systemctl stop cassandra-a $ sudo systemctl stop cassandra-b $ sudo systemctl stop cassandra-c ``` Once the SSD is replaced, I can take it from there. NOTE: Normally we'd decommission the entire host and wait with minimal urgency for a repair, but [[ https://phabricator.wikimedia.org/T342148 | we are over-capacity ]] and no longer able to do so. 🙁 ---- #### dmesg output ``` [ 2042.187996] ata5.00: exception Emask 0x0 SAct 0xf880440f SErr 0x0 action 0x0 [ 2042.195050] ata5.00: irq_stat 0x40000008 [ 2042.198995] ata5.00: failed command: READ FPDMA QUEUED [ 2042.204147] ata5.00: cmd 60/00:b8:a8:72:c1/01:00:9b:00:00/40 tag 23 ncq dma 131072 in res 51/40:00:a8:72:c1/00:01:9b:00:00/40 Emask 0x409 (media error) <F> [ 2042.220222] ata5.00: status: { DRDY ERR } [ 2042.224240] ata5.00: error: { UNC } [ 2042.228286] ata5.00: configured for UDMA/133 [ 2042.228360] sd 4:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 2042.228365] sd 4:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current] [ 2042.228371] sd 4:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed [ 2042.228376] sd 4:0:0:0: [sdc] tag#23 CDB: Read(10) 28 00 9b c1 72 a8 00 01 00 00 [ 2042.228379] print_req_error: I/O error, dev sdc, sector 2613146280 [ 2042.234633] ata5: EH complete [ 2042.363986] ata5.00: exception Emask 0x0 SAct 0x1202fff8 SErr 0x0 action 0x0 [ 2042.371045] ata5.00: irq_stat 0x40000008 [ 2042.374992] ata5.00: failed command: READ FPDMA QUEUED [ 2042.380151] ata5.00: cmd 60/08:e0:60:73:c1/00:00:9b:00:00/40 tag 28 ncq dma 4096 in res 51/40:08:60:73:c1/00:00:9b:00:00/40 Emask 0x409 (media error) <F> [ 2042.396050] ata5.00: status: { DRDY ERR } [ 2042.400068] ata5.00: error: { UNC } [ 2042.404049] ata5.00: configured for UDMA/133 [ 2042.404078] sd 4:0:0:0: [sdc] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 2042.404080] sd 4:0:0:0: [sdc] tag#28 Sense Key : Medium Error [current] [ 2042.404082] sd 4:0:0:0: [sdc] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed [ 2042.404084] sd 4:0:0:0: [sdc] tag#28 CDB: Read(10) 28 00 9b c1 73 60 00 00 08 00 [ 2042.404086] print_req_error: I/O error, dev sdc, sector 2613146464 [ 2042.410275] ata5: EH complete ``` See also: {T344210}