Page MenuHomePhabricator

Degraded RAID on analytics1059
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host analytics1059. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'analytics1059', '-c', 'get_raid_status_megacli']': RETCODE: 2
STDOUT:
b'CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.\n'
STDERR:
None

Event Timeline

Mentioned in SAL (#wikimedia-analytics) [2021-03-07T07:49:32Z] <elukey> umount /var/lib/hadoop/data/e on analytics1059 and restart hadoop daemons to exclude failed disk - T276696

Milimetric moved this task from Airflow to Incoming on the Analytics board.
Cmjohnson subscribed.

I replaced the disk it's in an unconfigured state: Can you add it back to the raid?

Firmware state: Unconfigured(good), Spun Up

Disk is added to raid. Thanks @Cmjohnson for doing the replacement.