Page MenuHomePhabricator

hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet
Closed, ResolvedPublic

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

The raid battery on cloudvirt1012.eqiad.wmnet has been stuck in recharging state. The machine remains in use and RAID health is still ok at the moment. The battery might be failing or something else is wrong.

$ sudo hpssacli ctrl slot=0 show detail | egrep 'Cache|Battery'
   Cache Serial Number: PDNLH0BRH9Y3T0
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: Not Configured
   Cache Ratio: 100% Read / 0% Write
   Read Cache Size: 0 MB
   Write Cache Size: 0 MB
   Drive Write Cache: Disabled
   Total Cache Size: 2.0 GB
   Total Cache Memory Available: 1.8 GB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: Recharging
   Cache Module Temperature (C): 38

Event Timeline

nskaggs renamed this task from Raid battery cloudvirt1012 to hw troubleshooting: Raid battery stuck in recharging for cloudvirt1012.eqiad.wmnet.Jul 15 2021, 9:45 PM
nskaggs assigned this task to Cmjohnson.
nskaggs added a project: ops-eqiad.
nskaggs updated the task description. (Show Details)

This host is out of warranty, but we've regularly seen HP raid controller batteries fail at 3+ years and require replacement. We tend to buy a couple at a time, so they may have a spare at eqiad, and if not then they can create an order task for one and we'll buy it.

Mentioned in SAL (#wikimedia-cloud) [2021-07-27T20:52:05Z] <andrewbogott> draining VMs off of cloudvirt1012 so we can replace the battery for T286748

Replaced failed battery from purchase T245697

Mentioned in SAL (#wikimedia-cloud) [2021-07-27T21:32:31Z] <andrewbogott> putting cloudvirt1012 back into service T286748