Page MenuHomePhabricator

hw troubleshooting: PERC1 battery failure for an-worker1148
Closed, ResolvedPublicRequest

Assigned To
Authored By
RKemper
Dec 6 2025, 2:36 AM
Referenced Files
F71530265: image.png
Thu, Jan 15, 10:42 AM
F71530261: image.png
Thu, Jan 15, 10:42 AM
F71526432: image.png
Wed, Jan 14, 5:41 PM
F70881707: image.png
Dec 6 2025, 2:36 AM

Description

FQDN: an-worker1148.eqiad.wmnet

Netbox: Marked as failed in netbox https://netbox.wikimedia.org/dcim/devices/3661/

Priority: Medium (the hadoop cluster can handle some node loss, however we need to perform rolling reboots on the fleet and the reboots aren't as safe if we have one host already unreliable)

Machine can be worked on at will.

This host failed to start back up after a reboot to apply a new linux kernel; upon investigating the IPMI, there was an error message that the PERC1 battery has failed

image.png (1×2 px, 566 KB)

The host has subsequently come back up, but we'd like to have you guys take a look.

There's also intermittent messages about a disk bay drive1 failure; perhaps some re-seating is needed?

Record:      32
Date/Time:   11/27/2025 08:17:11
Source:      system
Severity:    Critical
Description: Fault detected on drive 1 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   11/27/2025 08:25:41
Source:      system
Severity:    Ok
Description: Drive 1 in disk drive bay 1 is operating normally.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   11/27/2025 08:26:26
Source:      system
Severity:    Critical
Description: Fault detected on drive 1 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   12/03/2025 07:08:54
Source:      system
Severity:    Ok
Description: Drive 1 in disk drive bay 1 is operating normally.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   12/03/2025 08:39:30
Source:      system
Severity:    Critical
Description: The PERC1 battery has failed.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   12/04/2025 15:09:36
Source:      system
Severity:    Critical
Description: Fault detected on drive 1 in disk drive bay 1.
-------------------------------------------------------------------------------

Event Timeline

In the past I've had to assign the ticket to the relevant person, but I don't see those instructions in the template so hopefully I didn't mess up by not putting an assignee :)

@RKemper This server is out of warranty. I believe we might have a spare battery in stock. Is there any chance we can schedule downtime to open it up and possibly replace it, or at least verify whether the one I have is correct?

@RKemper This server is out of warranty. I believe we might have a spare battery in stock. Is there any chance we can schedule downtime to open it up and possibly replace it, or at least verify whether the one I have is correct?

Yes absolutely, what time block are you thinking of? We'll just need to downtime the host so it doesn't make noise but these hosts aren't in pybal so there's no need to depool or anything.

If I can do tomorrow between 3pm -6pm est time @RKemper.

If I can do tomorrow between 3pm -6pm est time @RKemper.

Does same time on weds work?

@RKemper I am usually here most mornings early. what day would work best for you next week to down time is there a chance you could down time it the day before for me to address in morning? or could should i maybe cordinate with @BTullis ?

@RKemper I am usually here most mornings early. what day would work best for you next week to down time is there a chance you could down time it the day before for me to address in morning? or could should i maybe cordinate with @BTullis ?

@Jclark-ctr now that I think about it, the only thing that needs to be ran is the downtime cookbook so it should be good to go whenever provided you're able to run the sre.downtime cookbook on an-worker1148. There's no depooling or anything needed.

@RKemper I do not have access to run down time

Mentioned in SAL (#wikimedia-operations) [2025-12-16T19:23:18Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1148.eqiad.wmnet with reason: T411919

@RKemper I do not have access to run down time

Ah, didn't realize. Okay, I put a downtime on an-worker1148 for the next 3 days, so you're clear to go ahead whenever's convenient (I'll manually lift when procedure is done).

And just a reminder that in addition to the perc1 battery issue there's also intermittent faults in drive 1 in disk drive bay 1 so maybe a re-seat needed there

@RKemper I replaced the battery and that error has cleared. It still shows an error for Drive Slot 1. I’ve opened an RMA for the drive since it was purchased within 1 year. Even though the box is out of warranty, we still have roughly two months remaining on the drive’s warranty.

@RKemper the replacement drive should arrive today. is there any way you or @BTullis can prep this to be replaced ?

Thanks @Jclark-ctr - You can replace this whenever is convenient.

Jclark-ctr subscribed.

thanks @BTullis drive has been swapped

Created the new VD.

btullis@an-worker1148:~$ sudo megacli -CfgLdAdd -r0 [32:1] -a0 
                                     
Adapter 0: Created VD 2

Adapter 0: Configured the Adapter!!

Exit Code: 0x00
btullis@an-worker1148:~$

Observed that this is presented to the OS as /dev/sdm

btullis@an-worker1148:~$ sudo dmesg -T|tail
[Fri Jan  9 15:38:32 2026] scsi 0:2:2:0: Attached scsi generic sg12 type 0
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB)
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] 4096-byte physical blocks
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Write Protect is off
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Mode Sense: 1f 00 00 08
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Fri Jan  9 15:38:32 2026] sdm: detected capacity change from 0 to 8000987201536
[Fri Jan  9 15:38:32 2026] sdm: detected capacity change from 0 to 8000987201536
[Fri Jan  9 15:38:32 2026] sdm: detected capacity change from 0 to 8000987201536
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Attached SCSI disk

Configured the filesystem.

btullis@an-worker1148:~$ sudo parted /dev/sdm --script mklabel gpt
btullis@an-worker1148:~$ sudo parted /dev/sdm --script mkpart primary ext4 0% 100%
btullis@an-worker1148:~$ sudo mkfs.ext4 -L hadoop-c /dev/sdm1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: bdb9c85b-ee38-486b-be85-bdfb209eaa0a
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done       

btullis@an-worker1148:~$ sudo tune2fs -m 0 /dev/sdm1
tune2fs 1.46.2 (28-Feb-2021)
Setting reserved blocks percentage to 0% (0 blocks)

Uncommented the entry for hadoop-c in /etc/fstab
Mounted the volumes and restarted the relevant services.

btullis@an-worker1148:~$ sudo vi /etc/fstab 
btullis@an-worker1148:~$ sudo mount -a
btullis@an-worker1148:~$ sudo systemctl restart hadoop-yarn-nodemanager hadoop-hdfs-datanode

When rebooting this server (as part of routine maintenance), it got stuck unable to boot. After powercycling and looking at console com2, it was getting an error for offline virtual drives with preserved cache.

After some fumbling around in the PERC H730P Mini BIOS Configuration Utility 5.18-0702 menu, I manually cleared the preserved cache and cleared the foreign config.

That took us from this previous state:

││Disk ID  Type     Capacity   State     DG   Vendor  ││Secured:               │
││00:01:00 SATA     7.276 TB   Online    01   ATA     ││N/A                    │
││00:01:01 SATA     7.276 TB   Foreign   -    ATA     ││Encryption Capable:    │
││00:01:02 SATA     7.276 TB   Online    02   ATA     ││No                     │
││00:01:03 SATA     7.276 TB   Online    03   ATA     ││Logical Sector Size:   │
││00:01:04 SATA     7.276 TB   Online    04   ATA     ││512 B                  │
││00:01:05 SATA     7.276 TB   Online    05   ATA     ││Physical Sector Size:  │
││00:01:06 SATA     7.276 TB   Online    06   ATA     ││4 KB                   │
││00:01:07 SATA     7.276 TB   Online    07   ATA     ││Product ID:            │
││00:01:08 SATA     7.276 TB   Online    08   ATA     ││ST8000NM023B-2TJ       │
││00:01:09 SATA     7.276 TB   Online    09   ATA     ││Revision:              │
││00:01:10 SATA     7.276 TB   Online    10   ATA     ││LA0C                   │
││00:01:11 SATA     7.276 TB   Online    11   ATA     ││Disk Write Cache:      │
││00:01:12 SSD-SATA 446.625 GB Online    00   ATA     ││Disable                │
││00:01:13 SSD-SATA 446.625 GB Online    00   ATA     ││S.M.A.R.T state:       │
││                                                    ││No Error               │
│└────────────────────────────────────────────────────┘│Power State:           │
│          ┌────────_                                  │ON

To one where the Foreign disk switched to being listed as Ready.

That got the server booting again. Still looking into follow-up steps to get fully back, perhaps I need to rerun Ben's megacli command again.

(Working with @bking ) We provisioned a virtual disk for missing drive via the drac web ui. Then we entered the rescue shell and commented out the hadoop-c /etc/fstab entry to get it to be able to fully boot.

The block device ordering / labeling was still very wonky. We basically saw this:

ryankemper@an-worker1148:~$ lsblk
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0 446.6G  0 disk 
├─sda1                               8:1    0   953M  0 part 
├─sda2                               8:2    0     1K  0 part 
└─sda5                               8:5    0 445.7G  0 part 
  ├─an--worker1148--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1148--vg-root        254:1    0  55.9G  0 lvm  /
  └─an--worker1148--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdb                                  8:16   0   7.3T  0 disk 
└─sdb1                               8:17   0   7.3T  0 part /var/lib/hadoop/data/b
sdc                                  8:32   0   7.3T  0 disk 
sdd                                  8:48   0   7.3T  0 disk 
└─sdd1                               8:49   0   7.3T  0 part /var/lib/hadoop/data/d
sde                                  8:64   0   7.3T  0 disk 
└─sde1                               8:65   0   7.3T  0 part /var/lib/hadoop/data/e
sdf                                  8:80   0   7.3T  0 disk 
└─sdf1                               8:81   0   7.3T  0 part /var/lib/hadoop/data/f
sdg                                  8:96   0   7.3T  0 disk 
└─sdg1                               8:97   0   7.3T  0 part /var/lib/hadoop/data/g
sdh                                  8:112  0   7.3T  0 disk 
└─sdh1                               8:113  0   7.3T  0 part /var/lib/hadoop/data/h
sdi                                  8:128  0   7.3T  0 disk 
└─sdi1                               8:129  0   7.3T  0 part /var/lib/hadoop/data/k
sdj                                  8:144  0   7.3T  0 disk 
└─sdj1                               8:145  0   7.3T  0 part /var/lib/hadoop/data/j
sdk                                  8:160  0   7.3T  0 disk 
└─sdk1                               8:161  0   7.3T  0 part /var/lib/hadoop/data/m
sdl                                  8:176  0   7.3T  0 disk 
└─sdl1                               8:177  0   7.3T  0 part /var/lib/hadoop/data/i
sdm                                  8:192  0   7.3T  0 disk 
└─sdm1                               8:193  0   7.3T  0 part /var/lib/hadoop/data/l

So there wasn't an sdc1. We tried to make one like so:

ryankemper@an-worker1148:~$ sudo parted /dev/sdc --script mklabel gpt
ryankemper@an-worker1148:~$ sudo parted /dev/sdc --script mkpart primary ext4 0% 100%
ryankemper@an-worker1148:~$ sudo mkfs.ext4 -L hadoop-c /dev/sdc1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: 38b4f412-3ab0-4518-b6f1-176d200138c8
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

But this might have mucked things up more, since /dev/sdc disappeared completely:

NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                              
sda                                  8:0    0 446.6G  0 disk                                                                                                                                                                         
├─sda1                               8:1    0   953M  0 part                                                                                                                                                                         
├─sda2                               8:2    0     1K  0 part                                                                                                                                                                         
└─sda5                               8:5    0 445.7G  0 part                                                                                                                                                                         
  ├─an--worker1148--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]                                                                                                                                                                  
  ├─an--worker1148--vg-root        254:1    0  55.9G  0 lvm  /                                                                                                                                                                       
  └─an--worker1148--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal                                                                                                                                                 
sdb                                  8:16   0   7.3T  0 disk                                                                                                                                                                         
└─sdb1                               8:17   0   7.3T  0 part /var/lib/hadoop/data/b                                                                                                                                                  
sdd                                  8:48   0   7.3T  0 disk                                                                                                                                                                         
└─sdd1                               8:49   0   7.3T  0 part /var/lib/hadoop/data/d                                                                                                                                                  
sde                                  8:64   0   7.3T  0 disk                                                                                                                                                                         
└─sde1                               8:65   0   7.3T  0 part /var/lib/hadoop/data/e                                                                                                                                                  
sdf                                  8:80   0   7.3T  0 disk                                                                                                                                                                         
└─sdf1                               8:81   0   7.3T  0 part /var/lib/hadoop/data/f                                                                                                                                                  
sdg                                  8:96   0   7.3T  0 disk                                                                                                                                                                         
└─sdg1                               8:97   0   7.3T  0 part /var/lib/hadoop/data/g                                                                                                                                                  
sdh                                  8:112  0   7.3T  0 disk                                                                                                                                                                         
└─sdh1                               8:113  0   7.3T  0 part /var/lib/hadoop/data/h                                                                                                                                                  
sdi                                  8:128  0   7.3T  0 disk                                                                                                                                                                         
└─sdi1                               8:129  0   7.3T  0 part /var/lib/hadoop/data/k                                                                                                                                                  
sdj                                  8:144  0   7.3T  0 disk                                                                                                                                                                         
└─sdj1                               8:145  0   7.3T  0 part /var/lib/hadoop/data/j                                                                                                                                                  
sdk                                  8:160  0   7.3T  0 disk                                                                                                                                                                         
└─sdk1                               8:161  0   7.3T  0 part /var/lib/hadoop/data/m                                                                                                                                                  
sdl                                  8:176  0   7.3T  0 disk                                                                                                                                                                         
└─sdl1                               8:177  0   7.3T  0 part /var/lib/hadoop/data/i                                                                                                                                                  
sdm                                  8:192  0   7.3T  0 disk                                                                                                                                                                         
└─sdm1                               8:193  0   7.3T  0 part /var/lib/hadoop/data/l

and dmesg is showing stuff like:

[   19.108881] bnxt_en 0000:18:00.0 eno1np0: FEC autoneg off encoding: None
[   20.029938] Process accounting resumed
[ 1091.036006]  sdc:
[ 1099.967749]  sdc: sdc1
[ 1172.173786] megaraid_sas 0000:3b:00.0: scanning for scsi0...
[ 1172.174064] sd 0:2:2:0: SCSI device is removed
[ 1172.273429] megaraid_sas 0000:3b:00.0: 3633 (821664118s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 02/c
[ 1172.273465] megaraid_sas 0000:3b:00.0: 3634 (821664118s/0x0001/FATAL) - VD 02/c is now OFFLINE

Current state: Host is booting, we've got the hadoop-c entry still commented out in fstab, and the rest of the state is kind of confusing

bking reopened this task as In Progress.Wed, Jan 14, 2:27 PM
bking claimed this task.

Some more lines from dmesg:

[ 1172.174064] sd 0:2:2:0: SCSI device is removed
[ 1172.273429] megaraid_sas 0000:3b:00.0: 3633 (821664118s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 02/c
[ 1172.273465] megaraid_sas 0000:3b:00.0: 3634 (821664118s/0x0001/FATAL) - VD 02/c is now OFFLINE
[ 1250.752497] megaraid_sas 0000:3b:00.0: 3651 (821664199s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 1

So either the vendor sent us a bad replacement drive, or (more likely) the slot 1 connection is dying. @BTullis , do you have a preference on what do to next? I'd probably just limp along with the broken drive slot since the PERC is out of warranty.

Host an-worker1148.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting after fixing disks

We went through the process of:

  • Deleting a foreign config for VD 02
  • Deleting the preserved cache for VD 02
  • Creating a new raid 0 volume
  • Preparing the file system on the new drive
  • Updating the /etc/fstab to allow it to mount.

When rebooting the host, the RAID controller took ages to perform its scan, then returned with this:

image.png (573×809 px, 155 KB)

So it looks like it didn't retain the configuration again.

I'm going to try updating the firmware on the RAID controller, to see if this fixes the issue.

The RAID controller firmware is already the latest version.

image.png (625×1 px, 155 KB)

image.png (749×1 px, 227 KB)

I'm continuing to look at whether we can make the configuration stick. Maybe I will have to wipe it and set up all 12 data drives again.

This is back up and running with 12 data drives.

btullis@an-worker1148:~$ findmnt|grep /dev/sd
├─/var/lib/hadoop/data/d        /dev/sdd1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/f        /dev/sdf1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/b        /dev/sdb1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/i        /dev/sdi1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/h        /dev/sdh1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/g        /dev/sdg1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/l        /dev/sdl1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/m        /dev/sdj1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/j        /dev/sdk1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/k        /dev/sdm1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/e        /dev/sde1                                  ext4       rw,noatime
└─/var/lib/hadoop/data/c        /dev/sdc1                                  ext4       rw,noatime

I'll resolve the ticket.