Page MenuHomePhabricator

Multiple RAID battery failures on hadoop worker hosts
Closed, ResolvedPublic

Description

We are currently experiencing RAID battery failures on multiple hadoop worker hosts.

  • analytics1068
  • an-worker1079
  • an-worker1083
  • an-worker1085
  • an-worker1089
  • an-worker1090
  • an-worker1093
  • an-worker1094

We believe that these servers (and their RAID controller batteries) are out of warranty, but it would be good to double-check.

The nature of the failure is that the batteries' total charge capacity has dropped to a few percentage points of its original maximum.

The behaviour we see as a result is that the Icinga checks flap frequently, as the batteries charge to just above and then drop below below the threshold at which the WriteBack cache policy can be applied. Multiple emails, IRC pings, and Alertmanager alerts are generated as a result of all of these flapping checks.

This is the Critical condition:

image.png (41×1 px, 13 KB)

This is the OK condition:

image.png (32×258 px, 6 KB)

Each controller is configured to switch to the WriteThrough cache policy whenever the battery health drops below this threshold, which reduces the write performance of each drive. (12 drives per host).
So the effect of the problem is a cumulative lowering of the performance of the whold Hadoop cluster. We have not correlated any other data inconsistencies or cluster write failures due the cards' switching policies, but we are actively checking for such a type of behaviour. There may be blips in the throughput to/from the disks when the RAID controllers switch policies, that we just haven't observed yet.

The purpose of this ticket is to investigate a long-term fix for this problem. There are several options, with different implications for different teams.

  1. Buy a stock of replacement RAID controller batteries and replace them each one when they reach this failure point.
  2. Change the raid controller policy permanently (and the monitoring check) when each host's battery reaches this failure point.
  3. Change the alerting mechanism so that this failure is effectively muted across the hadoop worker fleet.

Option 1 is the only one where we do not compromise the performance of the cluster, but it would involve considerably more work for the DC-Ops team.
Option 2 would reduce the likelihood that frequently switching cache policies on controller cards is affecting cluster stability
Option 3 is probably the lowest amount of work for all concerned, but is a compromise on cluster performance.

Event Timeline

BTullis renamed this task from RAID battery alert in an-worker1085 to Multiple RAID battery failures on hadoop worker hosts.Nov 16 2022, 12:51 PM
BTullis added a project: DC-Ops.
BTullis updated the task description. (Show Details)
BTullis added a subscriber: RobH.

Hi @RobH - It looks like all seven of the currently affected an-worker nodes were purchased at the same time under this ticket {T204177} and had a 3 year warranty. Is that correct?
So this seems to be an end-of-life issue for the RAID batteries, where they all fail at roughtly the same time. Do you happen to know when these particular hosts would be scheduled for decommissioning?

What's your feeling on whether it's feasible to buy a number of replacement batteries for these controllers?
On the surface of it, this would seem like the best option from our perspective, but it depends on the cost and how your team feels about sourcing the batteries and switching them out. Thanks.

One more server from the same batch has now started flapping, as of today. I have added an-worker1094 to the list.

So all of these hosts are out of warranty (3 year) but we run hardware for 5 years. Typically most high value items (RAM, CPU) fail within warranty, but we do sometimes see raid batteries fail like this around 4 years when there is still a year left in the use.

Usually replacing the battery isn't too expensive, but I cannot put pricing in this public task. I'll create a procurement sub-task to price out 16 or so raid batteries (23 hosts on the order, but not all have failed, so can order more later or scale up this order once we see the pricing.)

RobH mentioned this in Unknown Object (Task).Nov 17 2022, 5:41 PM
RobH added a subtask: Unknown Object (Task).

@BTullis: I've gone ahead and requested quotation to get replacement batteries. In the future, be aware we have a Hardware Failure Form linked form the DC Ops landing page that includes directions like what projects to tag in. I can see you created this initial task on Sept 27th, but as it did not have ops-eqiad or any specific site tag listed, it was not picked up and triaged by DC-Ops. (Inclusion of DC Ops is not enough to get review as there are just too many sites included under that project.)

If a hardware failure isn't tagged with the site of the servers in question, it won't be picked up for processing until someone is directly tagged in (as you did with me today).

T323301 tracks the ordering of 23 new raid controller batteries. While only 8 have failed so far, we have a total of 23 total hosts on the order for these hosts from 2018, as well as a eqiad total of 56 R730s with H730 raid controllers, so having some spare as more fail won't hurt!

@BTullis: Once you see the sub-task T323301 clear up as received (ETA Nov 29th if ordered today), we can then use this task to coordinate the downtime required to replace the raid controller batteries.

John,

When the shipment of replacement batteries arrive please coordinate with @BTullis to schedule individual host downtimes for raid battery replacements.

@BTullis: I've gone ahead and requested quotation to get replacement batteries. In the future, be aware we have a Hardware Failure Form linked form the DC Ops landing page that includes directions like what projects to tag in. I can see you created this initial task on Sept 27th, but as it did not have ops-eqiad or any specific site tag listed, it was not picked up and triaged by DC-Ops. (Inclusion of DC Ops is not enough to get review as there are just too many sites included under that project.)

If a hardware failure isn't tagged with the site of the servers in question, it won't be picked up for processing until someone is directly tagged in (as you did with me today).

Thanks @RobH for all the pointers and rapid action. I'll try to be more efficient about this kind of thing in future.

@BTullis we just received batteries. When would work best for you I would like to do them this week if possible?

@BTullis we just received batteries. When would work best for you I would like to do them this week if possible?

Great! Are you onsite today? What if I shut down all seven of the affected an-worker1* hosts now? Would that be convenient for you? You can simply power each of them back up once its battery has been replaced and it will rejoin the cluster.

I could do them in batches as well, if you rather. We can leave analytics1068 because that's about to be decommed anyway.

Yea I am on site right now Let me know when they are ready for me

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Nov 28 2022, 4:08 PM

Icinga downtime and Alertmanager silence (ID=c74eeb70-b29b-4aff-94c9-af5dbbe99cbd) set by btullis@cumin1001 for 6:00:00 on 6 host(s) and their services with reason: replacing RAID controller battery

an-worker[1079,1083,1085,1090,1093-1094].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=e8e1fd16-0d7d-47b2-8304-a9cb280e0cc5) set by btullis@cumin1001 for 6:00:00 on 1 host(s) and their services with reason: replacing RAID controller battery

an-worker1089.eqiad.wmnet

Thanks @Jclark-ctr - All hosts are shut down and ready for replacement.

image.png (290×1 px, 122 KB)

Feel free to replace the batteries and restart them in any order. Downtime is set for 6 hours.

Finished replacing raid battery on
an-worker1079
an-worker1083
an-worker1085
an-worker1089
an-worker1090
an-worker1093
an-worker1094