Page MenuHomePhabricator

Frequent disk resets on ms-be2075
Closed, ResolvedPublic

Description

(Note that we have disabled this server, this doesn't need to be acted on today, unless anyone happens to be present in the DC anyway)

Starting on Sunday we had seen Swift errors (also reported at https://phabricator.wikimedia.org/T382705), which could ultimatly be tracked down to frequent disk resets on ms-be2075:

[Dec23 12:11] sd 0:0:23:0: Power-on or device reset occurred
[ +16.749465] sd 0:0:23:0: Power-on or device reset occurred
[Dec23 12:12] sd 0:0:25:0: Power-on or device reset occurred
[ +14.749575] sd 0:0:2:0: Power-on or device reset occurred
[Dec23 12:13] sd 0:0:8:0: Power-on or device reset occurred
[ +22.702457] sd 0:0:17:0: Power-on or device reset occurred
[Dec23 12:14] sd 0:0:24:0: Power-on or device reset occurred
[Dec23 12:15] sd 0:0:24:0: Power-on or device reset occurred
[Dec23 12:16] sd 0:0:25:0: Power-on or device reset occurred
[ +25.749153] sd 0:0:8:0: Power-on or device reset occurred
[Dec23 12:17] sd 0:0:5:0: Power-on or device reset occurred
[  +9.679077] sd 0:0:5:0: Power-on or device reset occurred
[ +13.499560] sd 0:0:12:0: Power-on or device reset occurred
[ +16.749480] sd 0:0:12:0: Power-on or device reset occurred
[Dec23 12:18] sd 0:0:25:0: Power-on or device reset occurred
[ +51.651277] sd 0:0:24:0: Power-on or device reset occurred
[  +0.000003] sd 0:0:25:0: Power-on or device reset occurred
[Dec23 12:19] sd 0:0:24:0: Power-on or device reset occurred
[ +30.749023] sd 0:0:24:0: Power-on or device reset occurred
[ +16.249484] sd 0:0:24:0: Power-on or device reset occurred
[Dec23 12:20] sd 0:0:16:0: Power-on or device reset occurred
[  +0.000005] sd 0:0:3:0: Power-on or device reset occurred
[Dec23 12:21] sd 0:0:19:0: Power-on or device reset occurred
[ +11.499634] sd 0:0:25:0: Power-on or device reset occurred
[Dec23 12:22] sd 0:0:17:0: Power-on or device reset occurred
[ +45.998619] sd 0:0:18:0: Power-on or device reset occurred
[ +11.999585] sd 0:0:3:0: Power-on or device reset occurred
[Dec23 12:23] sd 0:0:24:0: Power-on or device reset occurred
[ +41.748701] sd 0:0:14:0: Power-on or device reset occurred
[Dec23 12:24] sd 0:0:21:0: Power-on or device reset occurred
[  +4.366527] sd 0:0:24:0: Power-on or device reset occurred
[  +2.249937] sd 0:0:2:0: Power-on or device reset occurred
[ +30.248988] sd 0:0:9:0: Power-on or device reset occurred
[ +16.499510] sd 0:0:17:0: Power-on or device reset occurred
[Dec23 12:26] sd 0:0:15:0: Power-on or device reset occurred
[ +16.499507] sd 0:0:17:0: Power-on or device reset occurred
[  +7.793269] sd 0:0:25:0: Power-on or device reset occurred
[Dec23 12:28] sd 0:0:0:0: Power-on or device reset occurred
[ +12.287019] sd 0:0:25:0: Power-on or device reset occurred

There are no errors flagged in SEL:

racadm>>racadm getsel
Record:      1
Date/Time:   11/22/2023 19:43:45
Source:      system
Severity:    Ok
Description: Log cleared.

So it seems likely that the disks are fine and this is caused by broken power supply connections to the disks? Maybe we can start with reseating all connectors (or swapping them if we have sufficient replacments around?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

is it still safe to work on this machine today?

we didn't have any spares that would work. a lot of the power cables are directly connected to the control board of the internal PDU.
powered off, drained the flea, pulled psus, reseated all the internal power cables. normalized and powered up. all indicators are green and it's pingable. Feel free to test this one out and @ me or papaul if the disk errors come back. If they do, I'll need to start troubleshooting/replacement with Dell.

Thanks for checking, I'm afraid the problem still exists:

[Tue Jan  7 16:49:18 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:19 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:20 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:21 2025] sd 0:0:24:0: Power-on or device reset occurred
[Tue Jan  7 16:49:21 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:22 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:23 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:24 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:30 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:31 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:32 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:33 2025] sd 0:0:25:0: Power-on or device reset occurred
[Tue Jan  7 16:49:33 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:34 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:36 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:38 2025] sd 0:0:4:0: Power-on or device reset occurred
[Tue Jan  7 16:49:39 2025] sd 0:0:4:0: Power-on or device reset occurred

@Jhancock.wm sorry, failed to notice the request for a ping if it was still unhappy. See previous comment :)

Icinga downtime and Alertmanager silence (ID=207ed568-35e8-41d5-b367-bb9f043b91bf) set by mvernon@cumin1002 for 8 days, 0:00:00 on 1 host(s) and their services with reason: host is awaiting attention from Dell

ms-be2075.codfw.wmnet

Service Request Number: 203753434

@Jhancock.wm one of the SSDs in this host looks unhappy now too (T383530), could you get that looked at at the same time, please?

i updated the ticket with that info. it might be related. still working with Dell.

@MatthewVernon can you get me a readout of the errors you are seeing on the SSDs? Dell is asking for it. Thanks.

Jan 13 01:04:10 ms-be2075 kernel: [462667.760590] megaraid_sas 0000:18:00.0: 18530 (790045449s/0x0020/DEAD) - Fatal firmware error: Line 977 in ../../dm/src/dm.c
Jan 13 01:04:10 ms-be2075 kernel: [462667.760590] 
Jan 13 01:04:10 ms-be2075 kernel: [462667.778200] megaraid_sas 0000:18:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Jan 13 01:04:10 ms-be2075 kernel: [462667.778245] megaraid_sas 0000:18:00.0: FW in FAULT state Fault code:0x10000 subcode:0x0 func:megasas_wait_for_outstanding_fusion
Jan 13 01:04:10 ms-be2075 kernel: [462667.789911] megaraid_sas 0000:18:00.0: resetting fusion adapter scsi0.
Jan 13 01:04:10 ms-be2075 kernel: [462667.789957] megaraid_sas 0000:18:00.0: Outstanding fastpath IOs: 2
Jan 13 01:04:10 ms-be2075 kernel: [462667.789972] megaraid_sas 0000:18:00.0: Reset not supported, killing adapter scsi0.
Jan 13 01:04:10 ms-be2075 kernel: [462667.802199] sd 0:0:24:0: [sdy] tag#28 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.802208] sd 0:0:24:0: [sdy] tag#28 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.802214] blk_update_request: I/O error, dev sdy, sector 20464176 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.813205] sd 0:0:25:0: [sdz] tag#29 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.813211] sd 0:0:25:0: [sdz] tag#29 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.813216] blk_update_request: I/O error, dev sdz, sector 20464176 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.824372] sd 0:0:24:0: [sdy] tag#30 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.824379] sd 0:0:24:0: [sdy] tag#30 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.824386] blk_update_request: I/O error, dev sdy, sector 20464576 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.835377] sd 0:0:25:0: [sdz] tag#31 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.835382] sd 0:0:25:0: [sdz] tag#31 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.835386] blk_update_request: I/O error, dev sdz, sector 20464576 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.846522] sd 0:0:24:0: [sdy] tag#32 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.846529] sd 0:0:24:0: [sdy] tag#32 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.846535] blk_update_request: I/O error, dev sdy, sector 20465176 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.857524] sd 0:0:25:0: [sdz] tag#33 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.857529] sd 0:0:25:0: [sdz] tag#33 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.857533] blk_update_request: I/O error, dev sdz, sector 20465176 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.868684] sd 0:0:24:0: [sdy] tag#34 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.868692] sd 0:0:24:0: [sdy] tag#34 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.868698] blk_update_request: I/O error, dev sdy, sector 20465576 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.879679] sd 0:0:25:0: [sdz] tag#35 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.879684] sd 0:0:25:0: [sdz] tag#35 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.879688] blk_update_request: I/O error, dev sdz, sector 20465576 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.890851] sd 0:0:24:0: [sdy] tag#36 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.890859] sd 0:0:24:0: [sdy] tag#36 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.890864] blk_update_request: I/O error, dev sdy, sector 20467344 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.901847] sd 0:0:25:0: [sdz] tag#37 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jan 13 01:04:10 ms-be2075 kernel: [462667.901852] sd 0:0:25:0: [sdz] tag#37 CDB: Unmap/Read sub-channel 42 00 00 00 00 00 00 00 18 00
Jan 13 01:04:10 ms-be2075 kernel: [462667.901856] blk_update_request: I/O error, dev sdz, sector 20467344 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Jan 13 01:04:10 ms-be2075 kernel: [462667.954967] md/raid1:md0: sdy2: rescheduling sector 54525960
Jan 13 01:04:10 ms-be2075 kernel: [462667.960873] md/raid1:md0: redirecting sector 54525960 to other mirror: sdz2
Jan 13 01:04:10 ms-be2075 kernel: [462667.960895] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462667.965176] md/raid1:md0: Disk failure on sdy2, disabling device.
Jan 13 01:04:10 ms-be2075 kernel: [462667.965176] md/raid1:md0: Operation continuing on 1 devices.
Jan 13 01:04:10 ms-be2075 kernel: [462667.977100] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462667.981390] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462667.981449] EXT4-fs error (device md0): ext4_wait_block_bitmap:570: comm fstrim: Cannot read block bitmap - block_group = 209, block_bitmap = 6815745
Jan 13 01:04:10 ms-be2075 kernel: [462667.985771] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.026500] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.030794] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.035478] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.039768] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.044362] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.048647] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.053400] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.057683] md: super_written gets error=-5
Jan 13 01:04:10 ms-be2075 kernel: [462668.062258] md: super_written gets error=-5
[similar repeated many times]
Jan 13 01:04:12 ms-be2075 kernel: [462669.748789] EXT4-fs warning (device md0): ext4_end_bio:347: I/O error 10 writing to inode 524835 starting block 4683038)
Jan 13 01:04:12 ms-be2075 kernel: [462669.748793] Buffer I/O error on device md0, logical block 4683038
Jan 13 01:04:12 ms-be2075 kernel: [462669.755000] EXT4-fs warning (device md0): ext4_end_bio:347: I/O error 10 writing to inode 524835 starting block 4657860)
Jan 13 01:04:12 ms-be2075 kernel: [462669.755002] Buffer I/O error on device md0, logical block 4657860
Jan 13 01:04:12 ms-be2075 kernel: [462669.761187] EXT4-fs warning (device md0): ext4_end_bio:347: I/O error 10 writing to inode 528348 starting block 4590533)
Jan 13 01:04:12 ms-be2075 kernel: [462669.761193] EXT4-fs warning (device md0): ext4_end_bio:347: I/O error 10 writing to inode 528348 starting block 4590532)
Jan 13 01:04:12 ms-be2075 kernel: [462669.761195] Buffer I/O error on device md0, logical block 4590532
Jan 13 01:04:12 ms-be2075 kernel: [462669.761259] Buffer I/O error on device md0, logical block 4590533
Jan 13 01:04:12 ms-be2075 kernel: [462669.767386] Buffer I/O error on dev md0, logical block 0, lost sync page write
Jan 13 01:04:12 ms-be2075 kernel: [462669.773579] Buffer I/O error on device md0, logical block 4590534

md0 is where / is mounted, and that filesystem is now damaged, so a number of commands just outright fail now.

@MatthewVernon Dell says they need an SOS report. they sent instructions for it. I can forward to you paste here if you need them.

That's going to be difficult, I'm afraid:

mvernon@ms-be2075:~$ sudo apt install sosreport
sudo: unable to execute /usr/bin/apt: Input/output error

If there are particular extra logs they need, we might be able to extract them, but this system is increasingly broken.

this is what they sent me

Steps on how to generate the SOS report:

The 'sos' package provides the sos report command, which is typically installed by default in Red Hat Enterprise Linux.

To verify the package installation:

  1. rpm -q sos sos-4.5.1-3.el8.noarch

If for some reason the 'sos' package is not installed, it can be installed using the below command:

  1. yum install sos

To generate a sos report in interactive mode (run as root):

• Red Hat Enterprise Linux 8 and later:

# sos report


• For Red Hat Enterprise Linux 7 and earlier:

# sosreport

Optionally, include the --batch option to generate a sos report in noninteractive mode:

  1. sos report –-batch Or # sosreport --batch

The log bundle (and its associated checksum file) can typically be saved in /var/tmp/. Older versions of Red Hat Enterprise Linux may save to a different location, but it is specified in the command output.

Hi, yes, those are Red-Hat specific instructions. On Debian & Ubuntu one has to install the sosreport package. Unfortunately, the root filesystem is now so damaged that that's impossible [attempting to run apt gets EIO]

thanks for the update and clarification. I've updated the ticket with the added info. maybe they'll quit stalling

hey, was out sick the last half of last week. got this from Dell:

I understand the situation. Upon reviewing the details, I noticed that the disks installed in slots 0-23 (Part# J7W80) are not ideally suited for PowerEdge Servers; they are designed for products like Optiplex. As a result, these drives have lower endurance compared to those designed for PowerEdge Servers. While they may function without errors for a certain period, they will eventually start generating alerts, similar to the ones currently being experienced.

Additionally, there are two SSD drives installed in slots 24 and 25. Although these are designed for PowerEdge machines, they are also generating alerts due to a configuration mismatch.

How do you want to proceed?

Wait, didn't we buy this server and all of its drives spinning and SSD from Dell? And now they're saying they're all the wrong drives?!?

Also, I'm afraid I've no idea what they mean about a config mismatch on the two SSDs, but again this will be as supplied by Dell...?

Yeah, this isn't an acceptable answer. They need to be more specific, I'm smelling their vagueness comes from not wanting to spend time/money.

yeah, found the original order. doesn't seem to be the case. T348059
i'm asking them if it's possibly the backplane or cabling issues.

Icinga downtime and Alertmanager silence (ID=62b3cb8f-dcae-4290-af1d-2a50d3785cb2) set by mvernon@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hardware broken awaiting vendor action

ms-be2075.codfw.wmnet

dell update. it's been escalated to the level 3 helpdesk. might not hear back from them until monday.

i did a bios and idrac upgrade, and generated a new report for dell at their request. @MatthewVernon could you do this part for me? You can email it to me if needed. new help desk, new hoops.

  1. Collect Smartctl data from all drives within the OS. Ensure smartmontools is installed.

Run the following command to get the bus device ID for the PERC:

smartctl --scan

Use the letters for each drive in the command below, replacing "a" and "q" with the correct drive letters:

for drive in {a..q}; do smartctl -a /dev/sd$drive > /tmp/sd$drive.txt; done

This command creates a text file with the Smartctl info for each drive (from drive A to drive Q) and saves it in the /tmp folder. Feel free to change the output path if needed.

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device
/dev/sdr -d scsi # /dev/sdr, SCSI device
/dev/sds -d scsi # /dev/sds, SCSI device
/dev/sdt -d scsi # /dev/sdt, SCSI device
/dev/sdu -d scsi # /dev/sdu, SCSI device
/dev/sdv -d scsi # /dev/sdv, SCSI device
/dev/sdw -d scsi # /dev/sdw, SCSI device
/dev/sdx -d scsi # /dev/sdx, SCSI device
/dev/sdy -d scsi # /dev/sdy, SCSI device
/dev/sdz -d scsi # /dev/sdz, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
/dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
/dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device
/dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device
/dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], SCSI device
/dev/bus/0 -d megaraid,16 # /dev/bus/0 [megaraid_disk_16], SCSI device
/dev/bus/0 -d megaraid,17 # /dev/bus/0 [megaraid_disk_17], SCSI device
/dev/bus/0 -d megaraid,18 # /dev/bus/0 [megaraid_disk_18], SCSI device
/dev/bus/0 -d megaraid,19 # /dev/bus/0 [megaraid_disk_19], SCSI device
/dev/bus/0 -d megaraid,20 # /dev/bus/0 [megaraid_disk_20], SCSI device
/dev/bus/0 -d megaraid,21 # /dev/bus/0 [megaraid_disk_21], SCSI device
/dev/bus/0 -d megaraid,22 # /dev/bus/0 [megaraid_disk_22], SCSI device
/dev/bus/0 -d megaraid,23 # /dev/bus/0 [megaraid_disk_23], SCSI device
/dev/bus/0 -d megaraid,24 # /dev/bus/0 [megaraid_disk_24], SCSI device
/dev/bus/0 -d megaraid,25 # /dev/bus/0 [megaraid_disk_25], SCSI device

As to the output of smartctl -a, I've tarred up the 26 resulting text files:

unproductive update. the level 3 helpdesk is still going over the files and the TSR report. Will update when i hear back from them.

i got some instructions from dell. kind of similar to what we tried with some extra cables to reset. Is this server still depooled?

Yeah, you can work on this server any time, but thanks for checking :)

Icinga downtime and Alertmanager silence (ID=a9517ffa-d053-4e3b-a7d0-6b08948ed456) set by mvernon@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: hardware broken awaiting vendor action

ms-be2075.codfw.wmnet

I reset a what they asked me to inside the server yesterday. When you get a chance, @MatthewVernon can you see if that fixed the errors.? Thanks

Hi,
I'm afraid the answer is "no":

Feb  5 15:23:01 ms-be2075 kernel: [71988.739632] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:23:02 ms-be2075 kernel: [71989.739584] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:23:13 ms-be2075 kernel: [72000.795998] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:23:14 ms-be2075 kernel: [72001.795963] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:23:15 ms-be2075 kernel: [72002.795929] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:23:30 ms-be2075 kernel: [72017.545500] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:24:08 ms-be2075 kernel: [72055.064293] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:24:17 ms-be2075 kernel: [72064.503290] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:24:39 ms-be2075 kernel: [72086.502609] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:24:45 ms-be2075 kernel: [72092.502432] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:25:00 ms-be2075 kernel: [72107.501965] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:25:38 ms-be2075 kernel: [72145.750799] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:26:57 ms-be2075 kernel: [72224.785899] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:27:14 ms-be2075 kernel: [72241.535387] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:27:38 ms-be2075 kernel: [72265.784636] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:27:41 ms-be2075 kernel: [72268.784518] sd 0:0:9:0: Power-on or device reset occurred
Feb  5 15:28:08 ms-be2075 kernel: [72295.033710] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:28:12 ms-be2075 kernel: [72299.033576] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:28:38 ms-be2075 kernel: [72325.782779] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:29:12 ms-be2075 kernel: [72359.781737] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:29:13 ms-be2075 kernel: [72360.781781] sd 0:0:25:0: Power-on or device reset occurred
Feb  5 15:29:13 ms-be2075 kernel: [72360.781785] sd 0:0:24:0: Power-on or device reset occurred
Feb  5 15:29:14 ms-be2075 kernel: [72361.781710] sd 0:0:24:0: Power-on or device reset occurred

big sigh. can i get another smartctl report to send to dell?

OK; same commands as before:

mvernon@ms-be2075:~$ sudo smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device
/dev/sdr -d scsi # /dev/sdr, SCSI device
/dev/sds -d scsi # /dev/sds, SCSI device
/dev/sdt -d scsi # /dev/sdt, SCSI device
/dev/sdu -d scsi # /dev/sdu, SCSI device
/dev/sdv -d scsi # /dev/sdv, SCSI device
/dev/sdw -d scsi # /dev/sdw, SCSI device
/dev/sdx -d scsi # /dev/sdx, SCSI device
/dev/sdy -d scsi # /dev/sdy, SCSI device
/dev/sdz -d scsi # /dev/sdz, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
/dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
/dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device
/dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device
/dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], SCSI device
/dev/bus/0 -d megaraid,16 # /dev/bus/0 [megaraid_disk_16], SCSI device
/dev/bus/0 -d megaraid,17 # /dev/bus/0 [megaraid_disk_17], SCSI device
/dev/bus/0 -d megaraid,18 # /dev/bus/0 [megaraid_disk_18], SCSI device
/dev/bus/0 -d megaraid,19 # /dev/bus/0 [megaraid_disk_19], SCSI device
/dev/bus/0 -d megaraid,20 # /dev/bus/0 [megaraid_disk_20], SCSI device
/dev/bus/0 -d megaraid,21 # /dev/bus/0 [megaraid_disk_21], SCSI device
/dev/bus/0 -d megaraid,22 # /dev/bus/0 [megaraid_disk_22], SCSI device
/dev/bus/0 -d megaraid,23 # /dev/bus/0 [megaraid_disk_23], SCSI device
/dev/bus/0 -d megaraid,24 # /dev/bus/0 [megaraid_disk_24], SCSI device
/dev/bus/0 -d megaraid,25 # /dev/bus/0 [megaraid_disk_25], SCSI device

Tarball of smartctl -a outputs attached again -

they're send a new backplane and controller card to try and fix this. i'll update when these parts have been replaced.

i got some parts in. a disk controller card and two backplanes (one for each set of drives). I got the card installed first. i need to lookup how to even get to the backplanes. if anyone is around to tell me if replacing the controller card fixed the issue, i'd appreciate it. I'll be reachable this weekend over IRC. i know everyone is busy getting ready to travel, so no rush. Ran out of cycles this week. Thanks for your patience.

The host appears to be down, so I can't look (and I'm just home from the pub, so I'm not about to attempt anything more involved). If you power it up, I can have a look (probably now Sunday US time at the earliest).

@Jhancock.wm this server is still not reachable over ssh...

had a weather event locally. taking another look at it today.

@MatthewVernon the new controller card wasn't registering for some reason. I reseated it and it shows up now. BUT. the raid config is gone. assuming its stored on the old card's memory. I still have the old one despite Dell's nagging. Do you want me to put the old one back in or do you want to try and recover this one? the mgmt is accessible.

This system has been drained, so I think it's OK to re-set-up the new card and then reimage the node. The disks should all be JBOD.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

@MatthewVernon it's hitting the wrong puppet server, but the server has an os installed and is sshable if you wanna see if the drives are behaving now. While i get this hiccup sorted out with the reimage.

Much less frequent (and only two devices now), but still there :-/:

Feb 23 01:34:08 ms-be2075 kernel: [109197.342692] sd 0:0:24:0: Power-on or device reset occurred
Feb 23 04:41:57 ms-be2075 kernel: [120466.782205] sd 0:0:24:0: Power-on or device reset occurred
Feb 23 04:43:50 ms-be2075 kernel: [120579.614637] sd 0:0:25:0: Power-on or device reset occurred
Feb 23 16:36:24 ms-be2075 kernel: [163333.437404] sd 0:0:25:0: Power-on or device reset occurred
Feb 24 07:17:08 ms-be2075 kernel: [216177.407709] sd 0:0:24:0: Power-on or device reset occurred
Feb 24 07:39:00 ms-be2075 kernel: [217490.121365] sd 0:0:24:0: Power-on or device reset occurred
Feb 24 11:21:39 ms-be2075 kernel: [230849.010234] sd 0:0:25:0: Power-on or device reset occurred

(the other drives aren't currently mounted, which may or may not make a difference).

those should be the boot disks. so we at least eliminated the errors on the others. gonna try a few things and i'll get back to you.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

I tried multiple times to run reimage but the host doesn't PXE boot, not sure why, I tried to follow the console com2 as well but no clear error highlighted.

This comment was removed by Jhancock.wm.

@elukey try now. it got disabled on the nic.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye completed:

  • ms-be2075 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502261621_elukey_3367441_ms-be2075.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I was able to reimage the node correctly, I have narrowed down a use case where a race condition caused puppet 5 to be deployed, but it is not this use case sadly. I'll try to do more research :(

@Jhancock.wm sorry, but despite all this, the errors remain:

Feb 27 02:35:13 ms-be2075 kernel: [35749.303700] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 02:43:04 ms-be2075 kernel: [36220.419609] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 02:43:34 ms-be2075 kernel: [36250.668847] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 02:44:22 ms-be2075 kernel: [36298.228079] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 02:44:23 ms-be2075 kernel: [36299.619405] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 02:45:25 ms-be2075 kernel: [36361.495892] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 02:47:05 ms-be2075 kernel: [36460.922206] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 02:52:00 ms-be2075 kernel: [36756.163865] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:00:14 ms-be2075 kernel: [37250.688808] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:03:38 ms-be2075 kernel: [37453.913565] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:14:07 ms-be2075 kernel: [38082.895866] sd 0:0:16:0: Power-on or device reset occurred
Feb 27 03:14:53 ms-be2075 kernel: [38129.245388] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:15:04 ms-be2075 kernel: [38140.516195] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:17:21 ms-be2075 kernel: [38277.691163] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:19:16 ms-be2075 kernel: [38391.937950] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:20:15 ms-be2075 kernel: [38451.030473] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:34:03 ms-be2075 kernel: [39279.757550] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:37:59 ms-be2075 kernel: [39515.250940] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:43:30 ms-be2075 kernel: [39846.241606] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:43:57 ms-be2075 kernel: [39873.021740] sd 0:0:24:0: Power-on or device reset occurred
Feb 27 03:47:43 ms-be2075 kernel: [40099.515384] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:47:49 ms-be2075 kernel: [40105.515207] sd 0:0:25:0: Power-on or device reset occurred
Feb 27 03:57:10 ms-be2075 kernel: [40665.999389] sd 0:0:24:0: Power-on or device reset occurred
[...]

ah man that disk 16 coming back is no bueno.

i was going to suggest making 24 and 25 a raid but with that coming back, i'm not sure.

another thing we can try is replaceing the OS drives. I have some decommissioned drives i could swap out for it. i got a set made by Micron and another by Samsung. thoughts?

@Jhancock.wm So this system has had new backplane and controller cards fitted? From comments on this ticket it looks like maybe controller cards have been done but not backplane?

Over the last few days, there are lots of these errors, mostly devices 0:0:24:0 and 0:0:25:0 but also smattering of other drives, which leads me to conclude that this is still a problem with the system itself (unless the previous issues have killed all the drives in the system)...

mvernon@ms-be2075:~$ grep 'Power-on' /var/log/kern.log | cut -f 9 -d ' ' | sort | uniq -c
      2 0:0:17:0:
     67 0:0:24:0:
     43 0:0:25:0:
      1 0:0:9:0:
mvernon@ms-be2075:~$ grep 'Power-on' /var/log/kern.log.1 | cut -f 9 -d ' ' | sort | uniq -c
      2 0:0:13:0:
      1 0:0:17:0:
    140 0:0:24:0:
    155 0:0:25:0:
      1 0:0:3:0:
mvernon@ms-be2075:~$ zgrep 'Power-on' /var/log/kern.log.2.gz | cut -f 9 -d ' ' | sort | uniq -c
      1 0:0:20:0:
    109 0:0:24:0:
    139 0:0:25:0:
      1 0:0:3:0:

the controller card was replaced. the two backplanes were not. correct. I figured it was more likely the controller card since it was system wide and not a specific set of drives. And since the storage drive errors are coming back, i'd have to agree on the system itself. But Dell is going to try to claim again that we are improperly using the drives.

I still don't see that Dell can claim we're using the drives incorrectly given they sold us this setup?
I think I'd tend to try swapping the backplanes first, but if you have a full set of spare drives and want to swap them all out instead, I don't mind (I'm just not really expecting it to help, it seems unlikely that we've killed all the drives, and the odd distribution of resets smells odd).

i agree. I'll get those backplanes replaced and we can try that. (honestly, been trying to figure how to do it since they're behind a lot of other parts)
and i agree, it seems like a bogus reply from dell about the misuse of drives. When i get them replaced, I'll tag you again.

the backplanes have been replaced. it was more difficult than i anticipated. When you have a chance, please let me know if the errors have ceased. Not sure if we'll need to re-configure anything from this hardware swap.

Sorry, still seeing errors about two of the drives:

mvernon@ms-be2075:~$ grep 'Power-on' /var/log/kern.log | cut -f 8 -d ' ' | sort | uniq -c
    648 0:0:24:0:
      4 0:0:25:0:

most recent log extract:

Mar 11 16:44:54 ms-be2075 kernel: [75721.962013] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:44:55 ms-be2075 kernel: [75722.961990] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:44:56 ms-be2075 kernel: [75723.961992] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:44:57 ms-be2075 kernel: [75724.961925] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:44:58 ms-be2075 kernel: [75725.961904] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:46:29 ms-be2075 kernel: [75817.209412] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:46:30 ms-be2075 kernel: [75818.209377] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:46:32 ms-be2075 kernel: [75819.459277] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:46:55 ms-be2075 kernel: [75842.708675] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:46:56 ms-be2075 kernel: [75844.208650] sd 0:0:24:0: Power-on or device reset occurred
Mar 11 16:59:52 ms-be2075 kernel: [76620.210334] sd 0:0:24:0: Power-on or device reset occurred

[Whether it's now worth trying either researting those two drives and/or trying replacements, I don't know, I don't know why this system seems cursed ATM]

well at least its still just these two OS drives. I'm gonna replace them. I'll need to reimage again.

Please go ahead, and thanks for all your work on this!

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

so fun update. after i replaced the drives i tried to reimage. it kept failing at partitioning the os drive. got papaul involved. found that the 1st os drive (24, the one with the most errors) kept having issues. replaced the drive with another and tried swapping 24 and 25 with the same results. opened up the chassis and replaced the sata cable connecting it to the system board. (i'd already reseated it twice before now). this could have been the original cause all along. or at least heavily contributing to it. it's got an os on it. so give it a go when you're ready and lmk if it needs more exorcism work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

OK, so the reimage isn't working because the SSDs are both RAID-0 arrays rather than JBOD. I'm going to try and un-RAID them, JBOD them, and try another reimage.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2075 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2075.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2075.codfw.wmnet with OS bullseye completed:

  • ms-be2075 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503171514_mvernon_493830_ms-be2075.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Finally got the reimage to work; I'll leave this host overnight, and then check the kernel log tomorrow.

@Jhancock.wm host seems good now - no resets reported since the reimage this time yesterday. Thanks for all your work on this!

[I'm now going to do more reimages of this host apropos T354872]