Page MenuHomePhabricator

Replace RAID controller battery on an-worker1086
Closed, ResolvedPublic

Description

We have experienced another RAID controller battery failure from a batch of hosts that have all been failing at similar times.

This time is is an-worker1086

Please would you replace the RAID controller battery, when convenient?

The server is still functioning, but with reduced performance as the cache is operating in a WriteThrough mode

image.png (316×720 px, 50 KB)

https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre&q=instance%3Dan-worker1086

It can be shut down at any time, without any special procedure required. Similarly, once the battery has been replaced it will rejoin the Hadoop cluster on boot, without any special action.

Event Timeline

BTullis triaged this task as Medium priority.Sep 25 2023, 12:22 PM
BTullis created this task.
BTullis added a project: Data-Platform-SRE.
BTullis moved this task from Incoming to Blocked / Waiting on the Data-Platform-SRE board.

@BTullis i am on site today. otherwise we can do it tomorrow

Icinga downtime and Alertmanager silence (ID=d9fc4cc1-c0d4-4a6d-83d0-127f1d08a401) set by btullis@cumin1001 for 3 days, 0:00:00 on 1 host(s) and their services with reason: Downtiming host for RAID controller battery replacement

an-worker1086.eqiad.wmnet

I have shut down the host, so it is ready for work. Feel free to boot it normally when finished. Thanks @Jclark-ctr.

@BTullis Replaced Raid controller battery server is coming back up now

This is second battery replacement for this server. T326127 was 1st. although battery did not physically look bad i did still replace it. If it returns we might need to look at other issues with possibly raid card

Just adding here, the server didn't boot successfully.
It firstly said that there was a foreign RAID config for one of the drives.
After this was imported it complained of disk errors.
{F37816784,width=60%}

Maybe I'll reopen this ticket to look at the disk issue.