Page MenuHomePhabricator

Icinga RAID check: monitor rebuild status
Open, MediumPublic

Description

puppet/modules/base/files/monitoring/check-raid.py:
Currently this Icinga check will say OK as soon as a volume begins rebuilding,
however it would be best if the monitor continued to return a WARNING until
redundancy is restored. Operators need awareness of in-progress rebuilds
because we may wish to treat a host specially, such as taking it out of a
service pool, until the rebuild is complete. This could be to minimize risk to
data integrity, or for performance reasons due to disk IO consumed by the RAID
rebuild process.
It seems most helpful if the monitor would display the time remaining and speed
of in-progress rebuilds, as supplied by /proc/mdstat:
md1 : active raid10 sdg2[6] sdf2[5] sda2[0] sdh2[7] sdd2[3] sde2[4] sdc2[2]
sdb2[1]
1132249088 blocks super 1.2 512K chunks 2 near-copies [8/8] [UUUUUUUU]
[===========>.........] check = 56.7% (642031616/1132249088) finish=5981.7min
speed=1365K/sec

Details

Reference
rt6796

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:49 AM
rtimport added a project: ops-core.
rtimport set Reference to rt6796.
Gage created this task.Feb 7 2014, 8:59 PM
Gage set Security to None.
fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jul 13 2016, 3:59 PM
fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".
fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:14 PM