Page MenuHomePhabricator

hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet
Closed, ResolvedPublicBUG REPORT

Description

Several attempts of formatting /dev/sde in cloudcephosd1030.eqiad.wmnet failed with I/O and corruption errors. I tried both with and without LVM. There seems to be something fishy with the physical disk.

The drive has been moved to cloudcephosd1034 (see comments below)

Please note the drive ordering can sometimes change on a reboot, so the name could change from sdX to sdY. The serial number of the faulty drive (from smartctl -a /dev/sdX) is KSA8N4825I0208C2G.

# pvs
  /dev/sde: Checksum error at offset 7168
  WARNING: invalid metadata text from /dev/sde at 7168.
  WARNING: metadata on /dev/sde at 7168 has invalid summary for VG.
  WARNING: bad metadata text on /dev/sde in mda1
  WARNING: scanning /dev/sde mda1 failed to read metadata summary.
  WARNING: repair VG metadata on /dev/sde with vgck --updatemetadata.
  WARNING: scan failed to get metadata summary from /dev/sde PVID wMLZDQHfBNpei9wJm0l87w200VVNcVWf
  WARNING: PV /dev/sde is marked in use but no VG was found using it.
  WARNING: PV /dev/sde might need repairing.
OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap

Event Timeline

I downloaded the support logs from the Dell "Support Assist" interface, but the zip file is too big to upload here. Let me know if you need it.

fnegri renamed this task from One disk in cloudcephosd1030 is not working properly to hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet.Aug 31 2022, 8:07 AM
fnegri assigned this task to Cmjohnson.
fnegri added a project: ops-eqiad.

Confirmed: Service Request 151038642 was successfully submitted.
Created dell ticket

Thanks @Jclark-ctr -- FYI I have temporarily removed this node from the Ceph cluster, so it can be safely rebooted/shut down if necessary.

@fnegri i did rerun support log and it did not show any errors that i noticed i still opened support ticket

Dell has requested i run Hardware Diagnostics after Support log showed no errors i have run multiple times and it has produced no errors. @fnegri can you try again?

@Jclark-ctr right now I cannot connect to cloudcephosd1030.mgmt.eqiad.wmnet with SSH.

Icinga is also showing a bunch of "UNKNOWN" https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cloudcephosd1030

Sorry i had left it in a screen for hardware test that is my mistake

No worries, now I was able to SSH, I created a test partition /dev/sde1 and indeed mkfs.ext4 /dev/sde1 did not throw any error this time... Then I was kicked off the instance, I SSH-ed again and now I can no longer see the partition I created 5 mins ago, I'm confused 🤔

The "kicked off" part is explained by @Jclark-ctr rebooting the instance. The partition disappearing instead is another symptom of some fault with the drive or the bay. It happened a second time, I created a partition with fdisk, formatted it with mkfs.ext4, then after a few minutes the partition was no longer there.

dmesg shows a good number of log lines similar to this one: (please note the drive name moved to sde to sdf after a reboot)

blk_update_request: I/O error, dev sdf, sector 2184 op 0x0:(READ) flags 0x80700 phys_seg 15 prio class 0

We verified that the problem is with the drive and not with the controller, because @Jclark-ctr moved the disk to cloudcephosd1034 and I could reproduce the issue there.

The faulty drive is currently sitting in cloudcephosd1034. I will proceed with adding 1030, 1031, 1032 and 1033 to the cluster.

fnegri renamed this task from hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet to hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet.Sep 14 2022, 9:07 AM

I am submitting a ticket with Dell for a new disk today. The h/w logs do not show an error but we will see what they say. @Jclark-ctr can you confirm it is the disk in slot 7?

Requested a disk You have successfully submitted request SR151635668.

Drive was already ordered just replaced right now @Cmjohnson