- - Provide FQDN of system.
cloudcephosd1017.eqiad.wmnet
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc)
Not very urgent, we have replication and no degradation is happening.
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
root@cloudcephosd1017:~# sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Feb-25-2021 | 10:23:29 | SEL | Event Logging Disabled | Log Area Reset/Cleared
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
On the 1st of March, the drive started failing and became unusable:
root@cloudcephosd1017:~# dmesg -T ... [Fri Mar 1 20:56:26 2024] INFO: task bstore_kv_sync:28671 blocked for more than 120 seconds. [Fri Mar 1 20:56:26 2024] Not tainted 4.19.0-22-amd64 #1 Debian 4.19.260-1 [Fri Mar 1 20:56:26 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Mar 1 20:56:26 2024] bstore_kv_sync D 0 28671 1 0x00000320 [Fri Mar 1 20:56:26 2024] Call Trace: [Fri Mar 1 20:56:26 2024] __schedule+0x29f/0x840 [Fri Mar 1 20:56:26 2024] schedule+0x28/0x80 [Fri Mar 1 20:56:26 2024] io_schedule+0x12/0x40 [Fri Mar 1 20:56:26 2024] wait_on_page_bit_common+0xfd/0x180 [Fri Mar 1 20:56:26 2024] ? page_cache_tree_insert+0xe0/0xe0 [Fri Mar 1 20:56:26 2024] __filemap_fdatawait_range+0xf3/0x150 [Fri Mar 1 20:56:26 2024] ? __filemap_fdatawrite_range+0xdd/0x110 [Fri Mar 1 20:56:26 2024] file_fdatawait_range+0x15/0x20 [Fri Mar 1 20:56:26 2024] ksys_sync_file_range+0x106/0x130 [Fri Mar 1 20:56:26 2024] __x64_sys_sync_file_range+0x1a/0x20 [Fri Mar 1 20:56:26 2024] do_syscall_64+0x53/0x110 [Fri Mar 1 20:56:26 2024] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fri Mar 1 20:56:26 2024] RIP: 0033:0x7fe861493f6f [Fri Mar 1 20:56:26 2024] Code: Bad RIP value. [Fri Mar 1 20:56:26 2024] RSP: 002b:00007fe850b8e0f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115 [Fri Mar 1 20:56:26 2024] RAX: ffffffffffffffda RBX: 000000000000002b RCX: 00007fe861493f6f [Fri Mar 1 20:56:26 2024] RDX: 0000000000002000 RSI: 000000d76cba3000 RDI: 000000000000002b [Fri Mar 1 20:56:26 2024] RBP: 000000d76cba3000 R08: 0000000000000000 R09: 0000000000000000 [Fri Mar 1 20:56:26 2024] R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000 [Fri Mar 1 20:56:26 2024] R13: 0000000000000007 R14: 0000000000000000 R15: 0000560f4b7dc380 [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#729 task abort called for scmd(00000000811e2acb) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#729 CDB: Write(10) 2a 00 0a 7c e8 78 00 00 08 00 [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(00000000811e2acb) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#728 task abort called for scmd(00000000eddfe714) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#728 CDB: Write(10) 2a 00 0a 7c e8 50 00 00 08 00 [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(00000000eddfe714) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#726 task abort called for scmd(000000003bc16d05) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#726 CDB: Write(10) 2a 00 0a 7c e7 78 00 00 18 00 [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(000000003bc16d05) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#725 task abort called for scmd(00000000a34213af) [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#725 CDB: Write(10) 2a 00 0a 7c e7 28 00 00 08 00 [Fri Mar 1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(00000000a34213af) ... [Mon Mar 4 09:18:09 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read [Mon Mar 4 09:47:36 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read [Mon Mar 4 10:18:17 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read [Mon Mar 4 10:48:09 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read [Mon Mar 4 11:17:25 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read [Mon Mar 4 11:47:42 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read [Mon Mar 4 12:18:03 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
We will probably have to replace it. Doing some checks to gather logs/try a couple of things.