[ceph] export number of bad sectors per-disk
Open, In Progress, HighPublic
Actions

Assigned To

Authored By

	dcaro
	Oct 12 2023, 7:54 AM

Description

We discovered in {} that a lot of the hard drive are experiencing bad sectors, but it flew under the radar as the current smartd alerts don't catch it.

This task is to create a node exporter to keep track of the number of bad sectors, and as a side-bonus to create an alert in two cases:

when the number is relatively high (say >1k) -> warning
when the number increases -> critical

The runbook should specify that those hard drives will need replacement, and for the increasing one specifically, to hurry up as the disk is degrading.

Related Objects
Search...

Status	Assigned	Task
In Progress	dcaro	T334240 [cloudceph] Slow operations - tracking task
Resolved	taavi	T348634 ceph slow ops 2023-10-11
		Unknown Object (Task)
		Unknown Object (Task)
In Progress	dcaro	T348643 cloudcephosd1021-1034: hard drive sector errors increasing
In Progress	dcaro	T348716 [ceph] export number of bad sectors per-disk
Open	dcaro	T349694 [ceph] Enable disk failure prediciton

Event Timeline

dcaro triaged this task as High priority.Oct 12 2023, 7:54 AM

dcaro created this task.

Restricted Application edited projects, added cloud-services-team; removed cloud-services-team (Kanban). · View Herald TranscriptOct 12 2023, 7:54 AM

dcaro changed the task status from Open to In Progress.Oct 20 2023, 8:11 AM

dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud) [2023-10-20T08:26:14Z] <dcaro> codfw ceph enabled diskprediction_local module, will take a bit to populate/start getting predictions (T348716)

It will take at least 6 days to get any predictions:
Oct 20 08:24:34 cloudcephmon2005-dev ceph-mgr[17446]: 2023-10-20T08:24:34.203+0000 7fe3ffdc3700 0 [diskprediction_local ERROR root] unable to predict device due to health data records less than 6 days

So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have to gather them separatedly.

In T348716#9279100, @dcaro wrote:

So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have to gather them separatedly.

Are you sure that this is not just a codfw1dev specific thing? Looks like the Ceph prometheus config is in profile::prometheus::cloud which is currently in eqiad only?

In T348716#9279122, @taavi wrote:

In T348716#9279100, @dcaro wrote:

So it seems that ceph does not export the disk metrics it collects to prometheus in any way, will have to gather them separatedly.

Are you sure that this is not just a codfw1dev specific thing? Looks like the Ceph prometheus config is in profile::prometheus::cloud which is currently in eqiad only?

Yep, I was looking into eqiad as as you mention, it's the only one gathering stat, and there's no ceph metrics related to the devices on prometheus (there's about rbds/etc., but no device health/stats or similar metrics).

dcaro added a parent task: T348643: cloudcephosd1021-1034: hard drive sector errors increasing.Feb 22 2024, 8:53 AM

taavi added a project: Cloud-VPS.Sep 28 2024, 12:45 PM

taavi moved this task from Unsorted to Storage on the Cloud-VPS board.Nov 1 2024, 7:04 PM

[ceph] export number of bad sectors per-diskOpen, In Progress, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

[ceph] export number of bad sectors per-disk
Open, In Progress, HighPublic
Actions

Related Objects
Search...