Page MenuHomePhabricator

hw troubleshooting: Replace RAID controller battery for an-worker1088.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system. an-worker1088.eqiad.wmnet
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Hello. Please could you replace the RAID controller battery in an-worker1088, when it's convenient for you to do so?

I can shut down the machine for you ahead of time. I haven't marked the machine as failed in netbox, because it's still running, just a bit more slowly than it should.

It's not super-urgent, but it's operating with reduced performance until we replace it. I've done some troubleshooting in T336077: MegaRAID error on an-worker1088 and tried upgrading the firmware of the RAID controller. However, the server is one of those in the batch identified in T318659: Multiple RAID battery failures on hadoop worker hosts so it is not unexpected that the battery should fail around this time.

Event Timeline

Jclark-ctr added a subscriber: Cmjohnson.

@BTullis is server shutdown for me to replace?

Icinga downtime and Alertmanager silence (ID=907e830f-d99b-4d2e-8752-2c13c8385200) set by btullis@cumin1001 for 4:00:00 on 1 host(s) and their services with reason: Replacing RAID controller battery

an-worker1088.eqiad.wmnet

@Jclark-ctr Many thanks. I have shut down the machine now. Please feel free to boot it once you've finished, as it should rejoin the cluster without any further involvement.

@BTullis raid battery has been replaced and is booting up now