Maniphest T209829

Degraded RAID on labcontrol1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Nov 19 2018, 1:17 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host labcontrol1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

connect to address 208.80.154.92 port 5666: No route to host

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'labcontrol1001', '-c', 'get_raid_status_md']': RETCODE: 2
STDOUT:

STDERR:
None

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Nov 19 2018, 1:17 PM

ops-monitoring-bot subscribed.

Manual check:

aborrero@icinga1001:~ $ /usr/lib/nagios/plugins/check_nrpe -4 -H labcontrol1001 -c get_raid_status_md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md2 : active raid1 sda3[0] sdb3[1]
      926825280 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0] sdb1[1]
      48794496 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

This server is coming up on 5 years old, I can replace a disk but your manual check does not show any failed disk and icinga does not know show a degraded raid

Degraded RAID on labcontrol1001Closed, ResolvedPublicActions

Description

Event Timeline

Degraded RAID on labcontrol1001
Closed, ResolvedPublic
Actions