- - Provide FQDN of system.
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc)
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
On 2025-07-12, the drive sdj in cloudcephosd1013 failed. It's no longer appearing in lsblk. The server is due for replacement in a few months, so we can leave the server with one fewer drive until it is decommissioned. I would still take out the failed drive to make sure it doesn't trigger additional errors.
Below some logs recorded at the moment of failure:
Jul 11 14:12:32 cloudcephosd1013 kernel: INFO: task md2_raid1:668 blocked for more than 120 seconds. Jul 11 14:12:32 cloudcephosd1013 kernel: Not tainted 5.10.0-35-amd64 #1 Debian 5.10.237-1 Jul 11 14:12:32 cloudcephosd1013 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [...] Jul 11 14:12:32 cloudcephosd1013 kernel: megaraid_sas 0000:18:00.0: pending commands remain after waiting, will reset adapter scsi0. Jul 11 14:12:32 cloudcephosd1013 kernel: blk_update_request: I/O error, dev sdj, sector 427105432 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0 Jul 11 14:12:32 cloudcephosd1013 kernel: scsi 0:0:8:0: rejecting I/O to dead device [...] Jul 11 14:12:32 cloudcephosd1013 kernel: Buffer I/O error on dev dm-4, logical block 226257573, async page read