Page MenuHomePhabricator

hw troubleshooting: PERC1 battery failure for an-worker1148
Closed, ResolvedPublicRequest

Assigned To
Authored By
RKemper
Dec 6 2025, 2:36 AM
Referenced Files
F74685241: Screenshot 2026-03-31 at 15.28.11.png
Mar 31 2026, 1:28 PM
F74684595: Screenshot 2026-03-31 at 15.21.26.png
Mar 31 2026, 1:21 PM
F74684596: Screenshot 2026-03-31 at 15.21.41.png
Mar 31 2026, 1:21 PM
F74676138: Screenshot 2026-03-31 at 13.56.13.png
Mar 31 2026, 11:57 AM
F74676065: Screenshot 2026-03-31 at 13.55.43.png
Mar 31 2026, 11:57 AM
F71530265: image.png
Jan 15 2026, 10:42 AM
F71530261: image.png
Jan 15 2026, 10:42 AM
F71526432: image.png
Jan 14 2026, 5:41 PM

Description

FQDN: an-worker1148.eqiad.wmnet

Netbox: Marked as failed in netbox https://netbox.wikimedia.org/dcim/devices/3661/

Priority: Medium (the hadoop cluster can handle some node loss, however we need to perform rolling reboots on the fleet and the reboots aren't as safe if we have one host already unreliable)

Machine can be worked on at will.

This host failed to start back up after a reboot to apply a new linux kernel; upon investigating the IPMI, there was an error message that the PERC1 battery has failed

image.png (1×2 px, 566 KB)

The host has subsequently come back up, but we'd like to have you guys take a look.

There's also intermittent messages about a disk bay drive1 failure; perhaps some re-seating is needed?

Record:      32
Date/Time:   11/27/2025 08:17:11
Source:      system
Severity:    Critical
Description: Fault detected on drive 1 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   11/27/2025 08:25:41
Source:      system
Severity:    Ok
Description: Drive 1 in disk drive bay 1 is operating normally.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   11/27/2025 08:26:26
Source:      system
Severity:    Critical
Description: Fault detected on drive 1 in disk drive bay 1.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   12/03/2025 07:08:54
Source:      system
Severity:    Ok
Description: Drive 1 in disk drive bay 1 is operating normally.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   12/03/2025 08:39:30
Source:      system
Severity:    Critical
Description: The PERC1 battery has failed.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   12/04/2025 15:09:36
Source:      system
Severity:    Critical
Description: Fault detected on drive 1 in disk drive bay 1.
-------------------------------------------------------------------------------

Event Timeline

In the past I've had to assign the ticket to the relevant person, but I don't see those instructions in the template so hopefully I didn't mess up by not putting an assignee :)

@RKemper This server is out of warranty. I believe we might have a spare battery in stock. Is there any chance we can schedule downtime to open it up and possibly replace it, or at least verify whether the one I have is correct?

@RKemper This server is out of warranty. I believe we might have a spare battery in stock. Is there any chance we can schedule downtime to open it up and possibly replace it, or at least verify whether the one I have is correct?

Yes absolutely, what time block are you thinking of? We'll just need to downtime the host so it doesn't make noise but these hosts aren't in pybal so there's no need to depool or anything.

If I can do tomorrow between 3pm -6pm est time @RKemper.

If I can do tomorrow between 3pm -6pm est time @RKemper.

Does same time on weds work?

@RKemper I am usually here most mornings early. what day would work best for you next week to down time is there a chance you could down time it the day before for me to address in morning? or could should i maybe cordinate with @BTullis ?

@RKemper I am usually here most mornings early. what day would work best for you next week to down time is there a chance you could down time it the day before for me to address in morning? or could should i maybe cordinate with @BTullis ?

@Jclark-ctr now that I think about it, the only thing that needs to be ran is the downtime cookbook so it should be good to go whenever provided you're able to run the sre.downtime cookbook on an-worker1148. There's no depooling or anything needed.

@RKemper I do not have access to run down time

Mentioned in SAL (#wikimedia-operations) [2025-12-16T19:23:18Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1148.eqiad.wmnet with reason: T411919

@RKemper I do not have access to run down time

Ah, didn't realize. Okay, I put a downtime on an-worker1148 for the next 3 days, so you're clear to go ahead whenever's convenient (I'll manually lift when procedure is done).

And just a reminder that in addition to the perc1 battery issue there's also intermittent faults in drive 1 in disk drive bay 1 so maybe a re-seat needed there

@RKemper I replaced the battery and that error has cleared. It still shows an error for Drive Slot 1. I’ve opened an RMA for the drive since it was purchased within 1 year. Even though the box is out of warranty, we still have roughly two months remaining on the drive’s warranty.

@RKemper the replacement drive should arrive today. is there any way you or @BTullis can prep this to be replaced ?

Thanks @Jclark-ctr - You can replace this whenever is convenient.

Jclark-ctr subscribed.

thanks @BTullis drive has been swapped

Created the new VD.

btullis@an-worker1148:~$ sudo megacli -CfgLdAdd -r0 [32:1] -a0 
                                     
Adapter 0: Created VD 2

Adapter 0: Configured the Adapter!!

Exit Code: 0x00
btullis@an-worker1148:~$

Observed that this is presented to the OS as /dev/sdm

btullis@an-worker1148:~$ sudo dmesg -T|tail
[Fri Jan  9 15:38:32 2026] scsi 0:2:2:0: Attached scsi generic sg12 type 0
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB)
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] 4096-byte physical blocks
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Write Protect is off
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Mode Sense: 1f 00 00 08
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Fri Jan  9 15:38:32 2026] sdm: detected capacity change from 0 to 8000987201536
[Fri Jan  9 15:38:32 2026] sdm: detected capacity change from 0 to 8000987201536
[Fri Jan  9 15:38:32 2026] sdm: detected capacity change from 0 to 8000987201536
[Fri Jan  9 15:38:32 2026] sd 0:2:2:0: [sdm] Attached SCSI disk

Configured the filesystem.

btullis@an-worker1148:~$ sudo parted /dev/sdm --script mklabel gpt
btullis@an-worker1148:~$ sudo parted /dev/sdm --script mkpart primary ext4 0% 100%
btullis@an-worker1148:~$ sudo mkfs.ext4 -L hadoop-c /dev/sdm1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: bdb9c85b-ee38-486b-be85-bdfb209eaa0a
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done       

btullis@an-worker1148:~$ sudo tune2fs -m 0 /dev/sdm1
tune2fs 1.46.2 (28-Feb-2021)
Setting reserved blocks percentage to 0% (0 blocks)

Uncommented the entry for hadoop-c in /etc/fstab
Mounted the volumes and restarted the relevant services.

btullis@an-worker1148:~$ sudo vi /etc/fstab 
btullis@an-worker1148:~$ sudo mount -a
btullis@an-worker1148:~$ sudo systemctl restart hadoop-yarn-nodemanager hadoop-hdfs-datanode

When rebooting this server (as part of routine maintenance), it got stuck unable to boot. After powercycling and looking at console com2, it was getting an error for offline virtual drives with preserved cache.

After some fumbling around in the PERC H730P Mini BIOS Configuration Utility 5.18-0702 menu, I manually cleared the preserved cache and cleared the foreign config.

That took us from this previous state:

││Disk ID  Type     Capacity   State     DG   Vendor  ││Secured:               │
││00:01:00 SATA     7.276 TB   Online    01   ATA     ││N/A                    │
││00:01:01 SATA     7.276 TB   Foreign   -    ATA     ││Encryption Capable:    │
││00:01:02 SATA     7.276 TB   Online    02   ATA     ││No                     │
││00:01:03 SATA     7.276 TB   Online    03   ATA     ││Logical Sector Size:   │
││00:01:04 SATA     7.276 TB   Online    04   ATA     ││512 B                  │
││00:01:05 SATA     7.276 TB   Online    05   ATA     ││Physical Sector Size:  │
││00:01:06 SATA     7.276 TB   Online    06   ATA     ││4 KB                   │
││00:01:07 SATA     7.276 TB   Online    07   ATA     ││Product ID:            │
││00:01:08 SATA     7.276 TB   Online    08   ATA     ││ST8000NM023B-2TJ       │
││00:01:09 SATA     7.276 TB   Online    09   ATA     ││Revision:              │
││00:01:10 SATA     7.276 TB   Online    10   ATA     ││LA0C                   │
││00:01:11 SATA     7.276 TB   Online    11   ATA     ││Disk Write Cache:      │
││00:01:12 SSD-SATA 446.625 GB Online    00   ATA     ││Disable                │
││00:01:13 SSD-SATA 446.625 GB Online    00   ATA     ││S.M.A.R.T state:       │
││                                                    ││No Error               │
│└────────────────────────────────────────────────────┘│Power State:           │
│          ┌────────_                                  │ON

To one where the Foreign disk switched to being listed as Ready.

That got the server booting again. Still looking into follow-up steps to get fully back, perhaps I need to rerun Ben's megacli command again.

(Working with @bking ) We provisioned a virtual disk for missing drive via the drac web ui. Then we entered the rescue shell and commented out the hadoop-c /etc/fstab entry to get it to be able to fully boot.

The block device ordering / labeling was still very wonky. We basically saw this:

ryankemper@an-worker1148:~$ lsblk
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0 446.6G  0 disk 
├─sda1                               8:1    0   953M  0 part 
├─sda2                               8:2    0     1K  0 part 
└─sda5                               8:5    0 445.7G  0 part 
  ├─an--worker1148--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1148--vg-root        254:1    0  55.9G  0 lvm  /
  └─an--worker1148--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdb                                  8:16   0   7.3T  0 disk 
└─sdb1                               8:17   0   7.3T  0 part /var/lib/hadoop/data/b
sdc                                  8:32   0   7.3T  0 disk 
sdd                                  8:48   0   7.3T  0 disk 
└─sdd1                               8:49   0   7.3T  0 part /var/lib/hadoop/data/d
sde                                  8:64   0   7.3T  0 disk 
└─sde1                               8:65   0   7.3T  0 part /var/lib/hadoop/data/e
sdf                                  8:80   0   7.3T  0 disk 
└─sdf1                               8:81   0   7.3T  0 part /var/lib/hadoop/data/f
sdg                                  8:96   0   7.3T  0 disk 
└─sdg1                               8:97   0   7.3T  0 part /var/lib/hadoop/data/g
sdh                                  8:112  0   7.3T  0 disk 
└─sdh1                               8:113  0   7.3T  0 part /var/lib/hadoop/data/h
sdi                                  8:128  0   7.3T  0 disk 
└─sdi1                               8:129  0   7.3T  0 part /var/lib/hadoop/data/k
sdj                                  8:144  0   7.3T  0 disk 
└─sdj1                               8:145  0   7.3T  0 part /var/lib/hadoop/data/j
sdk                                  8:160  0   7.3T  0 disk 
└─sdk1                               8:161  0   7.3T  0 part /var/lib/hadoop/data/m
sdl                                  8:176  0   7.3T  0 disk 
└─sdl1                               8:177  0   7.3T  0 part /var/lib/hadoop/data/i
sdm                                  8:192  0   7.3T  0 disk 
└─sdm1                               8:193  0   7.3T  0 part /var/lib/hadoop/data/l

So there wasn't an sdc1. We tried to make one like so:

ryankemper@an-worker1148:~$ sudo parted /dev/sdc --script mklabel gpt
ryankemper@an-worker1148:~$ sudo parted /dev/sdc --script mkpart primary ext4 0% 100%
ryankemper@an-worker1148:~$ sudo mkfs.ext4 -L hadoop-c /dev/sdc1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: 38b4f412-3ab0-4518-b6f1-176d200138c8
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

But this might have mucked things up more, since /dev/sdc disappeared completely:

NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                                                                                                                              
sda                                  8:0    0 446.6G  0 disk                                                                                                                                                                         
├─sda1                               8:1    0   953M  0 part                                                                                                                                                                         
├─sda2                               8:2    0     1K  0 part                                                                                                                                                                         
└─sda5                               8:5    0 445.7G  0 part                                                                                                                                                                         
  ├─an--worker1148--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]                                                                                                                                                                  
  ├─an--worker1148--vg-root        254:1    0  55.9G  0 lvm  /                                                                                                                                                                       
  └─an--worker1148--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal                                                                                                                                                 
sdb                                  8:16   0   7.3T  0 disk                                                                                                                                                                         
└─sdb1                               8:17   0   7.3T  0 part /var/lib/hadoop/data/b                                                                                                                                                  
sdd                                  8:48   0   7.3T  0 disk                                                                                                                                                                         
└─sdd1                               8:49   0   7.3T  0 part /var/lib/hadoop/data/d                                                                                                                                                  
sde                                  8:64   0   7.3T  0 disk                                                                                                                                                                         
└─sde1                               8:65   0   7.3T  0 part /var/lib/hadoop/data/e                                                                                                                                                  
sdf                                  8:80   0   7.3T  0 disk                                                                                                                                                                         
└─sdf1                               8:81   0   7.3T  0 part /var/lib/hadoop/data/f                                                                                                                                                  
sdg                                  8:96   0   7.3T  0 disk                                                                                                                                                                         
└─sdg1                               8:97   0   7.3T  0 part /var/lib/hadoop/data/g                                                                                                                                                  
sdh                                  8:112  0   7.3T  0 disk                                                                                                                                                                         
└─sdh1                               8:113  0   7.3T  0 part /var/lib/hadoop/data/h                                                                                                                                                  
sdi                                  8:128  0   7.3T  0 disk                                                                                                                                                                         
└─sdi1                               8:129  0   7.3T  0 part /var/lib/hadoop/data/k                                                                                                                                                  
sdj                                  8:144  0   7.3T  0 disk                                                                                                                                                                         
└─sdj1                               8:145  0   7.3T  0 part /var/lib/hadoop/data/j                                                                                                                                                  
sdk                                  8:160  0   7.3T  0 disk                                                                                                                                                                         
└─sdk1                               8:161  0   7.3T  0 part /var/lib/hadoop/data/m                                                                                                                                                  
sdl                                  8:176  0   7.3T  0 disk                                                                                                                                                                         
└─sdl1                               8:177  0   7.3T  0 part /var/lib/hadoop/data/i                                                                                                                                                  
sdm                                  8:192  0   7.3T  0 disk                                                                                                                                                                         
└─sdm1                               8:193  0   7.3T  0 part /var/lib/hadoop/data/l

and dmesg is showing stuff like:

[   19.108881] bnxt_en 0000:18:00.0 eno1np0: FEC autoneg off encoding: None
[   20.029938] Process accounting resumed
[ 1091.036006]  sdc:
[ 1099.967749]  sdc: sdc1
[ 1172.173786] megaraid_sas 0000:3b:00.0: scanning for scsi0...
[ 1172.174064] sd 0:2:2:0: SCSI device is removed
[ 1172.273429] megaraid_sas 0000:3b:00.0: 3633 (821664118s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 02/c
[ 1172.273465] megaraid_sas 0000:3b:00.0: 3634 (821664118s/0x0001/FATAL) - VD 02/c is now OFFLINE

Current state: Host is booting, we've got the hadoop-c entry still commented out in fstab, and the rest of the state is kind of confusing

bking reopened this task as In Progress.Jan 14 2026, 2:27 PM
bking claimed this task.

Some more lines from dmesg:

[ 1172.174064] sd 0:2:2:0: SCSI device is removed
[ 1172.273429] megaraid_sas 0000:3b:00.0: 3633 (821664118s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 02/c
[ 1172.273465] megaraid_sas 0000:3b:00.0: 3634 (821664118s/0x0001/FATAL) - VD 02/c is now OFFLINE
[ 1250.752497] megaraid_sas 0000:3b:00.0: 3651 (821664199s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 1

So either the vendor sent us a bad replacement drive, or (more likely) the slot 1 connection is dying. @BTullis , do you have a preference on what do to next? I'd probably just limp along with the broken drive slot since the PERC is out of warranty.

Host an-worker1148.eqiad.wmnet rebooted by btullis@cumin1003 with reason: Rebooting after fixing disks

We went through the process of:

  • Deleting a foreign config for VD 02
  • Deleting the preserved cache for VD 02
  • Creating a new raid 0 volume
  • Preparing the file system on the new drive
  • Updating the /etc/fstab to allow it to mount.

When rebooting the host, the RAID controller took ages to perform its scan, then returned with this:

image.png (573×809 px, 155 KB)

So it looks like it didn't retain the configuration again.

I'm going to try updating the firmware on the RAID controller, to see if this fixes the issue.

The RAID controller firmware is already the latest version.

image.png (625×1 px, 155 KB)

image.png (749×1 px, 227 KB)

I'm continuing to look at whether we can make the configuration stick. Maybe I will have to wipe it and set up all 12 data drives again.

This is back up and running with 12 data drives.

btullis@an-worker1148:~$ findmnt|grep /dev/sd
├─/var/lib/hadoop/data/d        /dev/sdd1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/f        /dev/sdf1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/b        /dev/sdb1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/i        /dev/sdi1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/h        /dev/sdh1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/g        /dev/sdg1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/l        /dev/sdl1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/m        /dev/sdj1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/j        /dev/sdk1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/k        /dev/sdm1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/e        /dev/sde1                                  ext4       rw,noatime
└─/var/lib/hadoop/data/c        /dev/sdc1                                  ext4       rw,noatime

I'll resolve the ticket.

Upon rebooting the host, we're back to the same issue:

F2  = System Setup
F10 = Lifecycle Controller
F11 = Boot Manager
F12 = PXE Boot
IPMI: Boot to

Initializing Serial ATA devices...


Broadcom NetXtreme Ethernet Boot Agent
Copyright (C) 2000-2020 Broadcom Corporation
All rights reserved.
Press Ctrl-S to enter Configuration Menu


Broadcom NetXtreme (NXE) Ethernet Boot Agent
Copyright (C) 2000-2020 Broadcom Limited
All rights reserved.
Press Ctrl-S to enter Configuration Menu

PowerEdge Expandable RAID Controller BIOS
Copyright(c) 2016 Avago Technologies
FWecouldunot9syncoupuconfig/propMchangeslfor some of the VD's/PD's
Presscanyekey5to.continue, or 'C' to load the configuration utility.

Seems like the controller itself might be hosed?

@RKemper T414948 is a pending decom for a few 740xd an-workers. If you’re able to speed any of those along, we could swap RAID cards.

Great, I'll ping you here when one is ready. Looks like we're a few days away from being drained, so the latest we'd have a host available to swap would be next monday, but there might be some magic we can do to terminate one earlier; I'll talk with Ben or somebody.

Yes, I think that we should be able to expedite a decom of one of these hosts.
We're really being a bit belt-and-braces by keeping the servers running until there are zero under-replicated blocks, so I feel that we can go ahead.

I'll just check and let you know for sure, tomorrow, if that's OK with you.

Mentioned in SAL (#wikimedia-operations) [2026-02-10T19:56:27Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1132,1148].eqiad.wmnet with reason: T411919

Alright, an-worker1132 has been shutdown.

We should be good to perform the swap @Jclark-ctr


(Hadoop-specific) I've verified I don't see any missing blocks via curl -sk 'https://localhost:50470/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' | grep -i missing with ssh -L 50470:localhost:50470 an-master1003.eqiad.wmnet, implying we're safe to proceed:

% curl -sk 'https://localhost:50470/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' | grep -i missing
    "MissingBlocks" : 0,
    "MissingReplOneBlocks" : 0

Icinga downtime and Alertmanager silence (ID=87f62daf-f17b-4abe-ba08-9d4650f912a3) set by btullis@cumin1003 for 6:00:00 on 1 host(s) and their services with reason: Replacing RAID card

an-worker1148.eqiad.wmnet

I have Swapped an-worker1148 and an-worker1132 Raid cards @RKemper both are powered on. fyi 1132 still list as active in netbox so decom cookbook needs to be run still for this device i believe

Mentioned in SAL (#wikimedia-operations) [2026-02-11T20:19:11Z] <ryankemper> T411919 Rebooting an-worker1148 to see if it comes back up properly after having switched the RAID card

Mentioned in SAL (#wikimedia-operations) [2026-02-11T20:48:30Z] <ryankemper> T411919 Rebooting an-worker1148 again. First reboot went great, but just double-checking, and also I'd removed a duplicate fstab entry so this will sanity check my change there

RAID card swap verified on an-worker1148. All checks pass:

  • 13/13 virtual drives Optimal, config persisted through reboot
  • 14/14 physical disks Online, zero media/predictive errors
  • DataNode running and registered with NameNode (Decommission Status: Normal, 79.40 TB configured)
  • Zero failed systemd units
  • Fixed duplicate hadoop-b entry in /etc/fstab (this wasn't causing problems but it was emitting a line to dmesg)

One cosmetic note: VD 2 (Slot 1) reports Bad Blocks Exist: Yes, but the backing drive has zero error counters. I'm fairly confident that this is stale metadata inherited from the old controller, but just wanted to mention it.


wrt 1132, I'll run the decom cookbook with that original decom ticket, so updates will be posted there

an-worker1148 was missing a /boot entry in its fstab (but the mbr was still able to find the grub stuff on /dev/sda1 so the host was still bootable, it just wouldn't upgrade its kernel upon reboot like i'd expected). I added that back in, and kernel is properly upgraded now.

This is all done. Good work everybody, this one was a bit of a doozy

@RKemper I can't seem to be able to run puppet on this host:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Number of datanode mountpoints (9) below threshold: 10, please check. (file: /srv/puppet_code/environments/production/modules/profile/manifests/hadoop/common.pp, line: 418, column: 9) on node an-worker1148.eqiad.wmnet

Indeed, I'm seeing only 9 hadoop data drives

brouberol@an-worker1148:~$ df -h | grep hadoop/data | wc -l
9
brouberol@an-worker1148:~$ df -h | grep hadoop/data
/dev/sde1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/e
/dev/sdi1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/g
/dev/sdg1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/h
/dev/sdd1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/f
/dev/sdj1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/k
/dev/sdf1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/i
/dev/sdh1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/j
/dev/sdk1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/l
/dev/sdl1                                   7.3T  4.3T  3.0T  60% /var/lib/hadoop/data/m

Seems like /dev/sdk is having some issues:

brouberol@an-worker1148:~$ sudo dmesg | grep sdk
[    9.359370] sd 0:2:11:0: [sdk] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB)
[    9.359372] sd 0:2:11:0: [sdk] 4096-byte physical blocks
[    9.359399] sd 0:2:11:0: [sdk] Write Protect is off
[    9.359401] sd 0:2:11:0: [sdk] Mode Sense: 1f 00 00 08
[    9.359469] sd 0:2:11:0: [sdk] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    9.359545] sdk: detected capacity change from 0 to 8000987201536
[    9.752614] sdk: detected capacity change from 0 to 8000987201536
[    9.754638]  sdk: sdk1
[    9.815601] sdk: detected capacity change from 0 to 8000987201536
[    9.827790] sd 0:2:11:0: [sdk] Attached SCSI disk
[   18.420794] EXT4-fs (sdk1): mounted filesystem with ordered data mode. Opts: (null)
[699069.233415] sd 0:2:11:0: [sdk] tag#7 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[699069.233431] sd 0:2:11:0: [sdk] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[699069.233435] sd 0:2:11:0: [sdk] tag#7 Sense Key : Medium Error [current]
[699069.233439] sd 0:2:11:0: [sdk] tag#7 Add. Sense: No additional sense information
[699069.233445] sd 0:2:11:0: [sdk] tag#7 CDB: Read(16) 88 00 00 00 00 00 19 d0 ba 00 00 00 02 00 00 00
[699069.233450] blk_update_request: I/O error, dev sdk, sector 433109504 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 0
[699069.628615] sd 0:2:11:0: [sdk] tag#746 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.645265] sd 0:2:11:0: [sdk] tag#748 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.661232] sd 0:2:11:0: [sdk] tag#749 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.677270] sd 0:2:11:0: [sdk] tag#750 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.693266] sd 0:2:11:0: [sdk] tag#753 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.709346] sd 0:2:11:0: [sdk] tag#754 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.709367] sd 0:2:11:0: [sdk] tag#754 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[699069.709380] sd 0:2:11:0: [sdk] tag#754 Sense Key : Medium Error [current]
[699069.709393] sd 0:2:11:0: [sdk] tag#754 Add. Sense: No additional sense information
[699069.709407] sd 0:2:11:0: [sdk] tag#754 CDB: Read(16) 88 00 00 00 00 00 19 d0 bb 90 00 00 00 08 00 00
[699069.709422] blk_update_request: I/O error, dev sdk, sector 433109904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[699069.720970] sd 0:2:11:0: [sdk] tag#755 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.737263] sd 0:2:11:0: [sdk] tag#756 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.753261] sd 0:2:11:0: [sdk] tag#757 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.769257] sd 0:2:11:0: [sdk] tag#758 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.785342] sd 0:2:11:0: [sdk] tag#759 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.801228] sd 0:2:11:0: [sdk] tag#760 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[699069.801249] sd 0:2:11:0: [sdk] tag#760 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[699069.801269] sd 0:2:11:0: [sdk] tag#760 Sense Key : Medium Error [current]
[699069.801275] sd 0:2:11:0: [sdk] tag#760 Add. Sense: No additional sense information
[699069.801280] sd 0:2:11:0: [sdk] tag#760 CDB: Read(16) 88 00 00 00 00 00 19 d0 bb 90 00 00 00 08 00 00
[699069.801285] blk_update_request: I/O error, dev sdk, sector 433109904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

All disks are reported healthy by SMART:

brouberol@an-worker1148:~$ sudo smart-data-dump --debug 2>&1 | grep healthy
...
# HELP device_smart_healthy SMART health
# TYPE device_smart_healthy gauge
device_smart_healthy{device="sat+megaraid,0"} 1.0
device_smart_healthy{device="sat+megaraid,1"} 1.0
device_smart_healthy{device="sat+megaraid,2"} 1.0
device_smart_healthy{device="sat+megaraid,3"} 1.0
device_smart_healthy{device="sat+megaraid,4"} 1.0
device_smart_healthy{device="sat+megaraid,5"} 1.0
device_smart_healthy{device="sat+megaraid,6"} 1.0
device_smart_healthy{device="sat+megaraid,7"} 1.0
device_smart_healthy{device="sat+megaraid,8"} 1.0
device_smart_healthy{device="sat+megaraid,9"} 1.0
device_smart_healthy{device="sat+megaraid,10"} 1.0
device_smart_healthy{device="sat+megaraid,11"} 1.0
device_smart_healthy{device="sat+megaraid,12"} 1.0
device_smart_healthy{device="sat+megaraid,13"} 1.0

Screenshot 2026-03-31 at 13.55.43.png (460×1 px, 105 KB)
Screenshot 2026-03-31 at 13.56.13.png (609×1 px, 93 KB)
Seems like all disks are healthy, but one of them isn't online.

Oh and something I overlooked in https://phabricator.wikimedia.org/T411919#11772073: we're back to having the device names and the mount points jumbled up.

brouberol@an-worker1148:~$ findmnt | grep hadoop
├─/var/lib/hadoop/journal       /dev/mapper/an--worker1148--vg-journalnode ext4       rw,relatime
├─/var/lib/hadoop/data/e        /dev/sde1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/g        /dev/sdi1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/h        /dev/sdg1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/f        /dev/sdd1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/k        /dev/sdj1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/i        /dev/sdf1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/j        /dev/sdh1                                  ext4       rw,noatime
├─/var/lib/hadoop/data/l        /dev/sdk1                                  ext4       rw,noatime
└─/var/lib/hadoop/data/m        /dev/sdl1                                  ext4       rw,noatime

The fstab seems to be correct though.

brouberol@an-worker1148:~$ cat /etc/fstab  | grep LABEL=hadoop | grep -v '#'
LABEL=hadoop-e	/var/lib/hadoop/data/e	ext4	defaults,noatime	0	2
LABEL=hadoop-f	/var/lib/hadoop/data/f	ext4	defaults,noatime	0	2
LABEL=hadoop-g	/var/lib/hadoop/data/g	ext4	defaults,noatime	0	2
LABEL=hadoop-h	/var/lib/hadoop/data/h	ext4	defaults,noatime	0	2
LABEL=hadoop-i	/var/lib/hadoop/data/i	ext4	defaults,noatime	0	2
LABEL=hadoop-j	/var/lib/hadoop/data/j	ext4	defaults,noatime	0	2
LABEL=hadoop-k	/var/lib/hadoop/data/k	ext4	defaults,noatime	0	2
LABEL=hadoop-l	/var/lib/hadoop/data/l	ext4	defaults,noatime	0	2
LABEL=hadoop-m	/var/lib/hadoop/data/m	ext4	defaults,noatime	0	2

I'm going to follow https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk to configure the missing disk

brouberol@an-worker1148:~$ sudo megacli -PDList -aAll | grep Firm
Firmware state: Online, Spun Up
Device Firmware Level: LA0B
Firmware state: Unconfigured(good), Spun Up. <---
Device Firmware Level: LA0C 
Firmware state: Online, Spun Up
...
brouberol@an-worker1148:~$ sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state"
Adapter #0
Enclosure Device ID: 32
Slot Number: 0
Firmware state: Online, Spun Up
---
Enclosure Device ID: 32
Slot Number: 1
Firmware state: Unconfigured(good), Spun Up
---
Enclosure Device ID: 32
Slot Number: 2
...
brouberol@an-worker1148:~$ sudo megacli -CfgLdAdd -r0 [32:1] -a0

Adapter 0: Created VD 2

Adapter 0: Configured the Adapter!!

Exit Code: 0x00
[Tue Mar 31 12:43:45 2026] sd 0:2:2:0: [sdm] 15626928128 512-byte logical blocks: (8.00 TB/7.28 TiB)
[Tue Mar 31 12:43:45 2026] sd 0:2:2:0: [sdm] 4096-byte physical blocks
[Tue Mar 31 12:43:45 2026] sd 0:2:2:0: [sdm] Write Protect is off
[Tue Mar 31 12:43:45 2026] sd 0:2:2:0: [sdm] Mode Sense: 1f 00 00 08
[Tue Mar 31 12:43:45 2026] sd 0:2:2:0: [sdm] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Tue Mar 31 12:43:45 2026] sdm: detected capacity change from 0 to 8000987201536
[Tue Mar 31 12:43:45 2026] sdm: detected capacity change from 0 to 8000987201536
[Tue Mar 31 12:43:45 2026]  sdm: sdm1
[Tue Mar 31 12:43:45 2026] sdm: detected capacity change from 0 to 8000987201536
[Tue Mar 31 12:43:45 2026] sd 0:2:2:0: [sdm] Attached SCSI disk
brouberol@an-worker1148:~$ lsblk -i -fs
NAME                           FSTYPE      FSVER    LABEL    UUID                                   FSAVAIL FSUSE% MOUNTPOINT
sda1                           ext4        1.0               2787d1ef-8015-4be9-9acc-5078abb7a0b6    654.6M    22% /boot
`-sda
sda2
`-sda
sdb1                           ext4        1.0      hadoop-d a468cf8c-f19a-4232-b242-7c7590a435f2
`-sdb
sdc1                           ext4        1.0      hadoop-b 9efd544f-4547-4117-b67d-abc71d58fa35
`-sdc
sdd1                           ext4        1.0      hadoop-f c00c95e1-d32a-4062-b002-1240605d0cb2        3T    59% /var/lib/hadoop/data/f
`-sdd
sde1                           ext4        1.0      hadoop-e 616f038e-5882-4037-96c0-ef5c2b7c92a3        3T    59% /var/lib/hadoop/data/e
`-sde
sdf1                           ext4        1.0      hadoop-i eb2eb8ed-5d57-4be4-a3ae-63f861214899        3T    59% /var/lib/hadoop/data/i
`-sdf
sdg1                           ext4        1.0      hadoop-h e7858780-fae8-418f-a11f-e6ff5a5e65b1        3T    59% /var/lib/hadoop/data/h
`-sdg
sdh1                           ext4        1.0      hadoop-j 4de0b883-44c6-4861-827c-4d575a89c849        3T    59% /var/lib/hadoop/data/j
`-sdh
sdi1                           ext4        1.0      hadoop-g 16140e92-282e-4482-8ca2-795e89291e84        3T    59% /var/lib/hadoop/data/g
`-sdi
sdj1                           ext4        1.0      hadoop-k 3a806639-d8c7-47cc-b55c-21e0e3455a63        3T    59% /var/lib/hadoop/data/k
`-sdj
sdk1                           ext4        1.0      hadoop-l c5190ccf-8a78-4570-b4bb-95464d01be6c      2.9T    59% /var/lib/hadoop/data/l
`-sdk
sdl1                           ext4        1.0      hadoop-m 5588775c-38ce-4b26-bce8-15a26f708530      2.9T    59% /var/lib/hadoop/data/m
`-sdl
sdm1                           ext4        1.0      hadoop-c 6c4f27e5-f9fc-4c6a-96ea-55290a67f937 <---
`-sdm
brouberol@an-worker1148:~$ sudo parted /dev/sdm --script mklabel gpt
brouberol@an-worker1148:~$ sudo parted /dev/sdm --script mkpart primary ext4 0% 100%
brouberol@an-worker1148:~$ sudo mkfs.ext4 -L hadoop-d /dev/sdm1
mke2fs 1.46.2 (28-Feb-2021)
/dev/sdm1 contains a ext4 file system labelled 'hadoop-c'
	last mounted on /var/lib/hadoop/data/c on Fri Feb 27 12:32:12 2026
Proceed anyway? (y,N) y
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: e27ebb6a-2e2c-48cd-8803-336f9b0e5661
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

brouberol@an-worker1148:~$ sudo tune2fs -m 0 /dev/sdm1
tune2fs 1.46.2 (28-Feb-2021)
Setting reserved blocks percentage to 0% (0 blocks)

I then re-enabled the following entry in /etc/fstab:

LABEL=hadoop-d	/var/lib/hadoop/data/d	ext4	defaults,noatime	0	2

Puppet failed with

Error: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d
Error: /Stage[main]/Bigtop::Hadoop::Worker/Bigtop::Hadoop::Worker::Paths[/var/lib/hadoop/data/d]/File[/var/lib/hadoop/data/d]/group: change from 'root' to 'hdfs' failed: Failed to set group to '903': Read-only file system @ apply2files - /var/lib/hadoop/data/d (corrective)

The mountpoint was mounted ro and the disk started to display errors back into the IDRAC.

Screenshot 2026-03-31 at 15.21.41.png (777×1 px, 111 KB)

Screenshot 2026-03-31 at 15.21.26.png (556×707 px, 60 KB)

I'm seeing

Fault detected on drive 1 in disk drive bay 1. 	Tue Mar 31 2026 12:56:39

in the IDRAC UI, which maps to ~1min after I mounted the disk.

Screenshot 2026-03-31 at 15.28.11.png (821×1 px, 280 KB)

Given that the server is out of warranty, @BTullis and I agreed that we'd just decommission it.

Agreed. I'm happy to decom this server. As per the original description, this drive bay keeps connecting and dropping. It costs us time each time that we try to re-add a data volume to Hadoop.

Change #1265421 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] hadoop/analytics: exclude an-worker1148.eqiad.wmnet

https://gerrit.wikimedia.org/r/1265421

Change #1265421 merged by Brouberol:

[operations/puppet@production] hadoop/analytics: exclude an-worker1148.eqiad.wmnet

https://gerrit.wikimedia.org/r/1265421

cookbooks.sre.hosts.decommission executed by brouberol@cumin1003 for hosts: an-worker1148.eqiad.wmnet

  • an-worker1148.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet server and PuppetDB

@Jclark-ctr an-worker1148 is now in decommissioning status (https://netbox.wikimedia.org/dcim/devices/3661/). Over to you, with many thanks!