Page MenuHomePhabricator

hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet
Closed, ResolvedPublic

Description

  • - Provide FQDN of system.

cloudcephosd1017.eqiad.wmnet

  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)

Not very urgent, we have replication and no degradation is happening.

root@cloudcephosd1017:~# sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Feb-25-2021 | 10:23:29 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

On the 1st of March, the drive started failing and became unusable:

root@cloudcephosd1017:~# dmesg -T
...
[Fri Mar  1 20:56:26 2024] INFO: task bstore_kv_sync:28671 blocked for more than 120 seconds.
[Fri Mar  1 20:56:26 2024]       Not tainted 4.19.0-22-amd64 #1 Debian 4.19.260-1
[Fri Mar  1 20:56:26 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar  1 20:56:26 2024] bstore_kv_sync  D    0 28671      1 0x00000320
[Fri Mar  1 20:56:26 2024] Call Trace:
[Fri Mar  1 20:56:26 2024]  __schedule+0x29f/0x840
[Fri Mar  1 20:56:26 2024]  schedule+0x28/0x80
[Fri Mar  1 20:56:26 2024]  io_schedule+0x12/0x40
[Fri Mar  1 20:56:26 2024]  wait_on_page_bit_common+0xfd/0x180
[Fri Mar  1 20:56:26 2024]  ? page_cache_tree_insert+0xe0/0xe0
[Fri Mar  1 20:56:26 2024]  __filemap_fdatawait_range+0xf3/0x150
[Fri Mar  1 20:56:26 2024]  ? __filemap_fdatawrite_range+0xdd/0x110
[Fri Mar  1 20:56:26 2024]  file_fdatawait_range+0x15/0x20
[Fri Mar  1 20:56:26 2024]  ksys_sync_file_range+0x106/0x130
[Fri Mar  1 20:56:26 2024]  __x64_sys_sync_file_range+0x1a/0x20
[Fri Mar  1 20:56:26 2024]  do_syscall_64+0x53/0x110
[Fri Mar  1 20:56:26 2024]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri Mar  1 20:56:26 2024] RIP: 0033:0x7fe861493f6f
[Fri Mar  1 20:56:26 2024] Code: Bad RIP value.
[Fri Mar  1 20:56:26 2024] RSP: 002b:00007fe850b8e0f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115
[Fri Mar  1 20:56:26 2024] RAX: ffffffffffffffda RBX: 000000000000002b RCX: 00007fe861493f6f
[Fri Mar  1 20:56:26 2024] RDX: 0000000000002000 RSI: 000000d76cba3000 RDI: 000000000000002b
[Fri Mar  1 20:56:26 2024] RBP: 000000d76cba3000 R08: 0000000000000000 R09: 0000000000000000
[Fri Mar  1 20:56:26 2024] R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000
[Fri Mar  1 20:56:26 2024] R13: 0000000000000007 R14: 0000000000000000 R15: 0000560f4b7dc380
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#729 task abort called for scmd(00000000811e2acb)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#729 CDB: Write(10) 2a 00 0a 7c e8 78 00 00 08 00
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(00000000811e2acb)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#728 task abort called for scmd(00000000eddfe714)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#728 CDB: Write(10) 2a 00 0a 7c e8 50 00 00 08 00
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(00000000eddfe714)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#726 task abort called for scmd(000000003bc16d05)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#726 CDB: Write(10) 2a 00 0a 7c e7 78 00 00 18 00
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(000000003bc16d05)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#725 task abort called for scmd(00000000a34213af)
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: [sdg] tag#725 CDB: Write(10) 2a 00 0a 7c e7 28 00 00 08 00
[Fri Mar  1 20:56:50 2024] sd 0:0:6:0: task abort: FAILED scmd(00000000a34213af)
...
[Mon Mar  4 09:18:09 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
[Mon Mar  4 09:47:36 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
[Mon Mar  4 10:18:17 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
[Mon Mar  4 10:48:09 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
[Mon Mar  4 11:17:25 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
[Mon Mar  4 11:47:42 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read
[Mon Mar  4 12:18:03 2024] Buffer I/O error on dev dm-7, logical block 468842480, async page read

We will probably have to replace it. Doing some checks to gather logs/try a couple of things.

Event Timeline

Mentioned in SAL (#wikimedia-cloud-feed) [2024-03-04T12:48:43Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.reboot_node (T359049)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-03-04T12:54:43Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.ceph.reboot_node (exit_code=0) (T359049)

After rebooting the hard drive came online, will try to repartition and see if it keeps failing

Wait no, the hard drive did not show up anymore (just the sdX letters got re-shuffled).

The drive is not appearing anymore at all xd

The last log in dmesg is:

[Mon Mar  4 12:53:17 2024] megaraid_sas 0000:18:00.0: 1636 (762871997s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 6

maybe related?

@dcaro server is out of warranty i did replace disk with an extra one we had on hand in eqiad please confirm fixed issue and close ticket

@Jclark-ctr the disk does not show up:

root@cloudcephosd1017:~# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                                                                                     8:0    0 223.6G  0 disk  
├─sda1                                                                                                  8:1    0   285M  0 part  
└─sda2                                                                                                  8:2    0 223.3G  0 part  
  └─md0                                                                                                 9:0    0 223.2G  0 raid1 
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:9    0 103.1G  0 lvm   /srv
sdb                                                                                                     8:16   0 223.6G  0 disk  
├─sdb1                                                                                                  8:17   0   285M  0 part  
└─sdb2                                                                                                  8:18   0 223.3G  0 part  
  └─md0                                                                                                 9:0    0 223.2G  0 raid1 
    ├─vg0-root                                                                                        253:0    0  74.5G  0 lvm   /
    ├─vg0-swap                                                                                        253:1    0   976M  0 lvm   [SWAP]
    └─vg0-srv                                                                                         253:9    0 103.1G  0 lvm   /srv
sdc                                                                                                     8:32   0   1.8T  0 disk  
└─ceph--e2be6aeb--5322--46d1--bfab--4311bd82d700-osd--block--e41e0d5a--6de8--4514--be92--35f556688a21 253:4    0   1.8T  0 lvm   
sdd                                                                                                     8:48   0   1.8T  0 disk  
└─ceph--a4629599--a5e0--46c9--a3b5--629c5fa67cea-osd--block--aded60c5--058e--486f--b681--d193f100cd7d 253:8    0   1.8T  0 lvm   
sde                                                                                                     8:64   0   1.8T  0 disk  
└─ceph--5ac6969a--2e09--4a40--9590--5e6ce5d391a3-osd--block--9185ae65--08ea--4b09--91d5--921e78ce3a48 253:7    0   1.8T  0 lvm   
sdf                                                                                                     8:80   0   1.8T  0 disk  
└─ceph--cef3e76e--29ab--4727--9472--80c01446fdab-osd--block--d473469b--a18f--4641--8ee3--76137839db18 253:6    0   1.8T  0 lvm   
sdg                                                                                                     8:96   0   1.8T  0 disk  
└─ceph--85015f5b--c44b--4ed3--ac58--7d1f41ca887c-osd--block--b0c403f0--1043--4e32--a060--bedd130168aa 253:5    0   1.8T  0 lvm   
sdh                                                                                                     8:112  0   1.8T  0 disk  
└─ceph--9835625e--c798--4681--b167--90bf3c54d84c-osd--block--05917d6e--5a5c--437c--9e4a--8366c9922df0 253:3    0   1.8T  0 lvm   
sdi                                                                                                     8:128  0   1.8T  0 disk  
└─ceph--13635d88--e8cb--403b--97b3--5cb3706525a9-osd--block--0bb8ecb0--d0e4--467d--8c15--cef999192023 253:2    0   1.8T  0 lvm

(should be sdj in that list, the 8th drive).

Do you have any other hard drive? Can you check that the hard drive is working in another host? Is it a new hard drive?

cloudcephosd1017 looks like the drive was listed as foreign I cleared the foreign status can you verify it now?

\o/ the drive is listed now, will add it to the cluster (will take a bit), and close the task once it's in (tomorrow most probably), thanks!

The drive is back online and in the cluster 👍