We are currently experiencing RAID battery failures on multiple hadoop worker hosts.
- analytics1068
- an-worker1079
- an-worker1083
- an-worker1085
- an-worker1089
- an-worker1090
- an-worker1093
- an-worker1094
We believe that these servers (and their RAID controller batteries) are out of warranty, but it would be good to double-check.
The nature of the failure is that the batteries' total charge capacity has dropped to a few percentage points of its original maximum.
The behaviour we see as a result is that the Icinga checks flap frequently, as the batteries charge to just above and then drop below below the threshold at which the WriteBack cache policy can be applied. Multiple emails, IRC pings, and Alertmanager alerts are generated as a result of all of these flapping checks.
This is the Critical condition:
This is the OK condition:
Each controller is configured to switch to the WriteThrough cache policy whenever the battery health drops below this threshold, which reduces the write performance of each drive. (12 drives per host).
So the effect of the problem is a cumulative lowering of the performance of the whold Hadoop cluster. We have not correlated any other data inconsistencies or cluster write failures due the cards' switching policies, but we are actively checking for such a type of behaviour. There may be blips in the throughput to/from the disks when the RAID controllers switch policies, that we just haven't observed yet.
The purpose of this ticket is to investigate a long-term fix for this problem. There are several options, with different implications for different teams.
- Buy a stock of replacement RAID controller batteries and replace them each one when they reach this failure point.
- Change the raid controller policy permanently (and the monitoring check) when each host's battery reaches this failure point.
- Change the alerting mechanism so that this failure is effectively muted across the hadoop worker fleet.
Option 1 is the only one where we do not compromise the performance of the cluster, but it would involve considerably more work for the DC-Ops team.
Option 2 would reduce the likelihood that frequently switching cache policies on controller cards is affecting cluster stability
Option 3 is probably the lowest amount of work for all concerned, but is a compromise on cluster performance.