Page MenuHomePhabricator

Degraded RAID on heze-array1
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host heze. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'heze', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
CHECK_NRPE: Socket timeout after 10 seconds.

STDERR:
None

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 13 2018, 10:43 AM
jijiki triaged this task as High priority.Oct 23 2018, 2:34 PM
jijiki assigned this task to Papaul.
jijiki added a subscriber: akosiaris.
jijiki added a subscriber: jijiki.Oct 23 2018, 2:41 PM
[Sat Oct 13 09:54:20 2018] megaraid_sas 0000:08:00.0: scanning for scsi7...
[Sat Oct 13 09:54:20 2018] megaraid_sas 0000:08:00.0: 6833 (592740763s/0x0001/CRIT) - VD 00/0 is now DEGRADED
[Sat Oct 13 09:54:20 2018] megaraid_sas 0000:08:00.0: 6834 (592740763s/0x0002/FATAL) - Patrol Read found an uncorrectable medium error on PD 0a(e0x0f/s9) at 1b4d76f8c
Papaul renamed this task from Degraded RAID on heze to Degraded RAID on heze-array1 .Oct 29 2018, 2:41 PM

There are 2 failed disks on the system. Slot 7 and slot 9 . This system is out of warranty and I have no 4TB SAS disks on site for replacement. I will have to open a procurement task for this.

Papaul mentioned this in Unknown Object (Task).Oct 29 2018, 9:09 PM

@Papaul I 'd say ignore it. That system+disk self/array is scheduled for decomission, to be replaced with backup2001 (T196477). The data in it is a copy of the data from helium so we ain't gonna lose something if more disks fail. There is no point in maintaining. After talking with @MoritzMuehlenhoff on IRC it seems like we can do a fresh reinstall of backup2001/backup1001 next week with the new stretch point release and set up the service on them and then decomission this

Papaul closed this task as Resolved.Nov 9 2018, 3:13 PM

@akosiaris thanks. Resolving this task.

JFTR, the disks are now in such a poor state that errors are being thrown by the OS, when installing a software update is was displaying errors like:

/dev/bacula/baculasd2: read failed after 0 of 4096 at 1099511619584: Input/output error
/dev/bacula/baculasd2: read failed after 0 of 4096 at 0: Input/output error
/dev/bacula/baculasd2: read failed after 0 of 4096 at 4096: Input/output error
/dev/sdb: read failed after 0 of 4096 at 0: Input/output error
/dev/sdb: read failed after 0 of 4096 at 44002476752896: Input/output error
/dev/sdb: read failed after 0 of 4096 at 44002476810240: Input/output error
/dev/sdb: read failed after 0 of 4096 at 4096: Input/output error
/dev/sdb1: read failed after 0 of 512 at 44002475638784: Input/output error
/dev/sdb1: read failed after 0 of 512 at 44002475741184: Input/output error
/dev/sdb1: read failed after 0 of 512 at 0: Input/output error
/dev/sdb1: read failed after 0 of 512 at 4096: Input/output error
/dev/sdb1: read failed after 0 of 2048 at 0: Input/output error
/dev/bacula/baculasd1: read failed after 0 of 4096 at 42880953417728: Input/output error
/dev/bacula/baculasd1: read failed after 0 of 4096 at 42880953475072: Input/output error
/dev/bacula/baculasd1: read failed after 0 of 4096 at 0: Input/output error
/dev/bacula/baculasd1: read failed after 0 of 4096 at 4096: Input/output error
/dev/bacula/baculasd2: read failed after 0 of 4096 at 1099511562240: Input/output error
/dev/bacula/baculasd2: read failed after 0 of 4096 at 1099511619584: Input/output error
/dev/bacula/baculasd2: read failed after 0 of 4096 at 0: Input/output error
/dev/bacula/baculasd2: read failed after 0 of 4096 at 4096: Input/output error
/dev/sdb: read failed after 0 of 4096 at 0: Input/output error
/dev/sdb: read failed after 0 of 4096 at 44002476752896: Input/output error
/dev/sdb: read failed after 0 of 4096 at 44002476810240: Input/output error
/dev/sdb: read failed after 0 of 4096 at 4096: Input/output error
/dev/sdb1: read failed after 0 of 512 at 44002475638784: Input/output error
/dev/sdb1: read failed after 0 of 512 at 44002475741184: Input/output error
/dev/sdb1: read failed after 0 of 512 at 0: Input/output error
/dev/sdb1: read failed after 0 of 512 at 4096: Input/output error
jijiki reopened this task as Open.Feb 22 2019, 1:03 PM
akosiaris closed this task as Resolved.Feb 22 2019, 2:47 PM

Re-closing per T206909#4734830

backup2001 is actually setup and working fine. I was waiting on T196478 to resume working on both of backup2001 and backup1001 together and then kill both heze and helium. Currently heze is the secondary backup server so there isn't much harm done.

The reasoning behind the coupling is to have both the bacula-director and bacula-sd daemons on the same versions to avoid protocol incompatibilities (the bacula-fds are actually fine on older versions). I 'll try and get T196478 moving again