Page MenuHomePhabricator

hw troubleshooting: RAID BBU for an-worker1078
Closed, ResolvedPublicRequest

Description

Please would you replace the RAID battery in an-worker1078.eqiad.wmnet

This is yet another Hadoop worker with a failed RAID controller battery. I'd be grateful if you could change it please.

If you let me know when would be a convenient time for you to replace it, I can shut down the machine ahead of time.

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

This server is out of warranty, I am not sure if we have any spares or a battery we can swap from a decom host. I'll update the task with more info after talking with @Jclark-ctr and Willy

@Cmjohnson we have a few batteries @BTullis if you can shut down server we can take care of it

Icinga downtime and Alertmanager silence (ID=d79d8e43-f7d6-4d5b-b758-f7be36ad2914) set by btullis@cumin1001 for 1 day, 12:00:00 on 1 host(s) and their services with reason: Replacing RAID BBU

an-worker1078.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2023-03-09T19:47:30Z] <btullis> shutting down an-worker1078 for RAID BBU replacement T331544

Thanks @Cmjohnson and @Jclark-ctr - I've shut down the machine and given it 36 hours of downtime.
Please feel free to boot it whenever the battery is replaced, it should rejoin the Hadoop cluster cleanly on boot, without any further interaction.
If it's not convenient to do it right now, also feel free to extend the downtime as you see fit. Cheers.

Replaced Failed BBu Server it booting