Page MenuHomePhabricator

hw troubleshooting: disk failure (sdb) on cloudcephmon1004
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system. cloudcephmon1004.eqiad.wmnet
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc) It's not very urgent, but we have no redundancy while the disk is failed. Failing that redundancy should not cause any downtime, as there's another redundancy level, but failing that one there would be a full outage for all WMCS services.
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) The sdb drive started failing yesterday around noon, see T392424: Degraded RAID on cloudcephmon1004 for the logs
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

The host is still under warranty.

root@cloudcephmon1004:~# sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Nov 26 11:34:32 2024
        Raid Level : raid10
        Array Size : 1874534400 (1787.70 GiB 1919.52 GB)
     Used Dev Size : 937267200 (893.85 GiB 959.76 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Apr 23 08:35:05 2025
             State : active, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : cloudcephmon1004:0  (local to host cloudcephmon1004)
              UUID : 97219653:c33e3546:d9a2b57f:a2125219
            Events : 351460

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync set-A   /dev/sda2
       -       0        0        1      removed
       2       8       34        2      active sync set-A   /dev/sdc2
       3       8       50        3      active sync set-B   /dev/sdd2

       1       8       18        -      faulty   /dev/sdb2

The SupportAssist logfile is in drive: https://drive.google.com/file/d/14yhm9CZB0mjeFucYp-a4xnAZcJrZv7Rw/view?usp=sharing (too big for phabricator)

Confirmed: Service Request 209219050 was successfully submitted.

andrea.denisse raised the priority of this task from Medium to Unbreak Now!.
andrea.denisse subscribed.

Hi, this is doesn't seem to be resolved as we're still getting email notifications as of today: DegradedArray event on /dev/md/0:cloudcephmon1004

Please take a look at it if you can, this has been firing sending an email every day for more than a month already, thank you!

andrea.denisse renamed this task from hw troubleshooting: disk failure (sdb) on coludcephmon1004 to hw troubleshooting: disk failure (sdb) on cloudcephmon1004.May 26 2025, 3:27 PM
andrea.denisse added a subscriber: Jclark-ctr.