Page MenuHomePhabricator

[ceph] export number of bad sectors per-disk
Open, HighPublic

Description

We discovered in {} that a lot of the hard drive are experiencing bad sectors, but it flew under the radar as the current smartd alerts don't catch it.

This task is to create a node exporter to keep track of the number of bad sectors, and as a side-bonus to create an alert in two cases:

  • when the number is relatively high (say >1k) -> warning
  • when the number increases -> critical

The runbook should specify that those hard drives will need replacement, and for the increasing one specifically, to hurry up as the disk is degrading.

Event Timeline

dcaro triaged this task as High priority.
dcaro changed the task status from Open to In Progress.Oct 20 2023, 8:11 AM
dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud) [2023-10-20T08:26:14Z] <dcaro> codfw ceph enabled diskprediction_local module, will take a bit to populate/start getting predictions (T348716)

It will take at least 6 days to get any predictions:
Oct 20 08:24:34 cloudcephmon2005-dev ceph-mgr[17446]: 2023-10-20T08:24:34.203+0000 7fe3ffdc3700 0 [diskprediction_local ERROR root] unable to predict device due to health data records less than 6 days

So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have to gather them separatedly.

So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have to gather them separatedly.

Are you sure that this is not just a codfw1dev specific thing? Looks like the Ceph prometheus config is in profile::prometheus::cloud which is currently in eqiad only?

So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have to gather them separatedly.

Are you sure that this is not just a codfw1dev specific thing? Looks like the Ceph prometheus config is in profile::prometheus::cloud which is currently in eqiad only?

Yep, I was looking into eqiad as as you mention, it's the only one gathering stat, and there's no ceph metrics related to the devices on prometheus (there's about rbds/etc., but no device health/stats or similar metrics).

dcaro added a parent task: Restricted Task.Feb 22 2024, 8:53 AM
Aklapper changed the task status from In Progress to Open.Apr 11 2025, 10:14 PM
Aklapper subscribed.

Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half years (see T380300).

@dcaro Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on 2025-11-25.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!