Page MenuHomePhabricator

hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet
Closed, ResolvedPublic

Description

From SSH:

root@cloudvirt2004-dev:~# less /var/log/puppet.log
-bash: /usr/bin/less: Input/output error

root@cloudvirt2004-dev:~# dmesg
-bash: /usr/bin/dmesg: Input/output error

fnegri@cloudvirt2004-dev:~$ cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[0](F) sda2[1]
      467894272 blocks super 1.2 [2/1] [_U]
      bitmap: 1/4 pages [4KB], 65536KB chunk

unused devices: <none>
fnegri@cloudvirt2004-dev:~$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Sep 15 13:27:22 2023
        Raid Level : raid1
        Array Size : 467894272 (446.22 GiB 479.12 GB)
     Used Dev Size : 467894272 (446.22 GiB 479.12 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Oct  4 22:21:16 2023
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 1
     Spare Devices : 0

Consistency Policy : bitmap

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8        2        1      active sync   /dev/sda2

       0       8       18        -      faulty   /dev/sdb2

In console com2:

[2149618.979982] I/O error, dev sda, sector 95684040 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[2149618.988936] I/O error, dev sda, sector 95684040 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[2149623.774999] megaraid_sas 0000:65:00.0: Controller in crit error
[2149668.758769] I/O error, dev sda, sector 95684040 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2

Event Timeline

fnegri renamed this task from HDD failure in cloudvirt2004-dev to hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet.Oct 10 2023, 12:48 PM

This is not urgent and can wait a few days if necessary.

@fnegri I'm also seeing a potentially failed DIMM. is it safe to power down the server for troubleshooting?

fnegri added a parent task: Unknown Object (Task).Oct 10 2023, 3:40 PM

Re: DIMM
I've swapped B1 and B7. if the error recurs in B7, it is the stick. If it recurs in B1, it is possibly the system board

Re: HDD
When I first logged into the server, I was not able to access any of the PERC functions, like making disk blink, seeing status of disks, etc.
After the DIMM reset I can. I so see all the disks but I cannot view the state of software raid at this time. I'm thinking that the error on the lead DIMM for CPU B may have caused the PERC to not behave as it should.

I've booted it back up and there are no hardware errors as of now. I can observe it for a few days and see if it fails again, or we can repool it while we observe.

Please let me know if you are still getting disk errors after this and I will open a ticket with dell to have it replaced.

new error popped up after rebooting
T348550

This seems to have resolved on its own? /usr/local/lib/nagios/plugins/check_eth reports OK now.

The host is effectively already repooled because as soon as it started it rejoined the cluster. Not a big deal as it's a test cluster, I'd say let's keep it pooled for a few days and observe if we see any errors.

it's been 3 days. good enough for me.

Mentioned in SAL (#wikimedia-cloud) [2024-01-16T09:24:18Z] <taavi> move cloudvirt2004-dev from 'failed' to 'active' in netbox - seems like that was for T348531 which is now resolved