- - Provide FQDN of system. cloudcephmon1004.eqiad.wmnet
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- - Put system into a failed state in Netbox.
- - Provide urgency of request, along with justification (redundancy, dependencies, etc) It's not very urgent, but we have no redundancy while the disk is failed. Failing that redundancy should not cause any downtime, as there's another redundancy level, but failing that one there would be a full outage for all WMCS services.
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) The sdb drive started failing yesterday around noon, see T392424: Degraded RAID on cloudcephmon1004 for the logs
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
Description
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • aborrero | T392423 KernelErrors Server cloudcephmon1004 logged kernel errors | |||
| Resolved | VRiley-WMF | T392424 Degraded RAID on cloudcephmon1004 | |||
| Resolved | Request | Jclark-ctr | T392458 hw troubleshooting: disk failure (sdb) on cloudcephmon1004 |
Event Timeline
Comment Actions
The host is still under warranty.
root@cloudcephmon1004:~# sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Nov 26 11:34:32 2024
Raid Level : raid10
Array Size : 1874534400 (1787.70 GiB 1919.52 GB)
Used Dev Size : 937267200 (893.85 GiB 959.76 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Apr 23 08:35:05 2025
State : active, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Consistency Policy : bitmap
Name : cloudcephmon1004:0 (local to host cloudcephmon1004)
UUID : 97219653:c33e3546:d9a2b57f:a2125219
Events : 351460
Number Major Minor RaidDevice State
0 8 2 0 active sync set-A /dev/sda2
- 0 0 1 removed
2 8 34 2 active sync set-A /dev/sdc2
3 8 50 3 active sync set-B /dev/sdd2
1 8 18 - faulty /dev/sdb2The SupportAssist logfile is in drive: https://drive.google.com/file/d/14yhm9CZB0mjeFucYp-a4xnAZcJrZv7Rw/view?usp=sharing (too big for phabricator)
Comment Actions
Hi, this is doesn't seem to be resolved as we're still getting email notifications as of today: DegradedArray event on /dev/md/0:cloudcephmon1004
Please take a look at it if you can, this has been firing sending an email every day for more than a month already, thank you!
Comment Actions
Thanks to Taavi for adding /dev/sdb back to software raid. https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&from=now-12h&to=now&timezone=utc&var-site=eqiad