Page MenuHomePhabricator

db1224 is unreachable
Open, HighPublic

Description

db1224 was rebooted as part of T426633 on yesterday:

2026-05-27 16:48:08.975142 dbctl instance db1224 pool -p 100
2026-05-27 16:48:14.074742 dbctl config commit -b -m "Repooling after maintenance db1224 (T426633)

Now it crashed and getsel shows a number of errors pasted below.

The server is now depooled, can you please take a look?

(As usual, please update any firmware if needed)

Event Timeline

getsel:

-------------------------------------------------------------------------------
Record:      57
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      60
Date/Time:   05/28/2026 15:06:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      61
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      62
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      63
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      64
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      65
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      66
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      67
Date/Time:   05/28/2026 15:06:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      68
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      69
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      70
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      71
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      72
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      73
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      74
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      75
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      76
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      77
Date/Time:   05/28/2026 15:06:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      78
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      79
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      80
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      81
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      82
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      83
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      84
Date/Time:   05/28/2026 15:06:10
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      85
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      86
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      87
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      88
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      89
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      90
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      91
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      92
Date/Time:   05/28/2026 15:06:11
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      93
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      94
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      95
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      96
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      97
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      98
Date/Time:   05/28/2026 15:06:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2026-05-28T16:22:13Z] <fceratto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 99 days, 0:00:00 on db1224.eqiad.wmnet with reason: unreachable T427535

@FCeratto-WMF This unit shoud be reachable again. iDRAC scans hardware during boot. If it detects an error Dell classifies as uncorrectable, iDRAC will not rescan the device until the next boot cycle, even if the error is likely a false positive. However, would you be able to check it? It's reading all green at the moment.

@FCeratto-WMF If you'd like, I can also update the firmware before it's fully handed over as well? either way, let me know

@VRiley-WMF the host is not responding on ssh and not generating metrics so maybe it did not power up. Please update the firmware and tomorrow I'll try to powercycle it.

Understood, I'll continue to look into this

From the available firmware choices, it seems as though it's up to date. I know the BIOS is completely upto date. I was able to login to the machine and see it was stuck at a certain point and it needed to be rebooted for the error to be cleared. It should be good to go now

I'm seeing the following errors in the logs that look a bit suspicious, specifically the N/A, transition to Non-recoverable ; CPU 2 ;, could it be a hardware issue?

May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:07, System Firmware Additional Info, N/A, OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:07, System Firmware CPU Machine Chk, N/A, transition to Non-recoverable ; CPU 2 ; APIC ID 0
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:07, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = D04h ; Register Value = 00h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:07, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 04h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = BEh
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = E04h ; Register Value = 00h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 74h ; Register Value = 29h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = FB3h ; Register Value = FFh
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = FFFh ; Register Value = FFh
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:08, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = F04h ; Register Value = 00h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 74h ; Register Value = 29h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = FB3h ; Register Value = FFh
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = FFFh ; Register Value = FFh
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Chipset Info, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 7Eh
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = A8h ; Register Value = 30h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = A9h ; Register Value = 01h
May 29 02:55:24 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = AAh ; Register Value = 30h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = ABh ; Register Value = 01h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Chipset Info, N/A, OEM Diagnostic Data Event ; Register Offset = F2h ; Register Value = 7Fh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:09, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = EEh ; Register Value = D8h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = EFh ; Register Value = A0h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Chipset Info, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = FEh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = A8h ; Register Value = 02h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = A9h ; Register Value = 03h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = AAh ; Register Value = 02h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = ABh ; Register Value = 03h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:10, System Firmware Chipset Info, N/A, OEM Diagnostic Data Event ; Register Offset = F2h ; Register Value = FFh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = EEh ; Register Value = 18h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware Err Reg Pointer, N/A, OEM Diagnostic Data Event ; Register Offset = EFh ; Register Value = 14h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = D04h ; Register Value = 00h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 04h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = 00h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = 0h ; Register Value = FEh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = E04h ; Register Value = 00h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:11, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = D8h ; Register Value = 91h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:12, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = EB3h ; Register Value = FFh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:12, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = FFFh ; Register Value = FFh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:12, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = F04h ; Register Value = 00h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:12, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = D8h ; Register Value = 91h
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:12, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = EB3h ; Register Value = FFh
May 29 02:55:25 db1224 ipmiseld[1128]: SEL System Event: May-28-2026, 15:06:12, System Firmware MSR Info Log, N/A, OEM Diagnostic Data Event ; Register Offset = FFFh ; Register Value = FFh

Change #1295397 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1224: disable notifications

https://gerrit.wikimedia.org/r/1295397

Change #1295397 merged by Federico Ceratto:

[operations/puppet@production] db1224: disable notifications

https://gerrit.wikimedia.org/r/1295397

I'll take a deeper look into this. It's okay to reboot, correct?

VRiley-WMF changed the task status from Open to In Progress.Mon, Jun 1, 11:13 AM

Updating BIOS

BIOS is now at 1.21.1 (previous was 1.12.1). Moving onto iDRAC

iDRAC has been completed. moving onto Non-expander storage backplane

Firmware (BIOS, iDRAC and Non-expander storage backplane) have been updated (I thought they were up to date before, but new information was pointed out to me). Through iDRAC I can see the login screen.

@FCeratto-WMF please test it out and let us know if it's working now.

Thank you for the update! Closing this. Please let us know if anything else happens!

Change #1295940 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] db1224: enable notifications

https://gerrit.wikimedia.org/r/1295940

Change #1295940 merged by Federico Ceratto:

[operations/puppet@production] db1224: enable notifications

https://gerrit.wikimedia.org/r/1295940

Starting pool of db1224 by fceratto@cumin1003: Pooling

Completed pooling of db1224 by fceratto@cumin1003: Pooling

Starting pool of db1224 by fceratto@cumin1003: Pooling

Completed pooling of db1224 by fceratto@cumin1003: Pooling

Starting pool of db1224 by fceratto@cumin1003: Pooling

Completed pooling of db1224 by fceratto@cumin1003: Pooling

The host crashed again.

FCeratto-WMF removed a subscriber: VRiley-WMF.

The HW logs are empty but this host has crashed again twice today. One when I was doing a reimage and a second one after it got to production
@VRiley-WMF is there something we can do to pass more info to Dell? this is definitely not normal and reminds me a bit of that other Dell host that crashed lots of times and we had to in the end, receive a new chassis. (db1246 - T387673 T359940 T361968 T363119 T374215)
Please advise

Change #1297699 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1224: Disable notifications

https://gerrit.wikimedia.org/r/1297699

Change #1297699 merged by Marostegui:

[operations/puppet@production] db1224: Disable notifications

https://gerrit.wikimedia.org/r/1297699

I will gather logs of this server and submit them to Dell. For them to replace the chassis takes a lot of convincing, so I'll see what I can do. There are other troubleshooting measures I'll attempt as well.

I will gather logs of this server and submit them to Dell. For them to replace the chassis takes a lot of convincing, so I'll see what I can do. There are other troubleshooting measures I'll attempt as well.

Yeah I know. It took quite lots of emails to get them convinced. Also, they changed pretty much all the changeable pieces (memory, power supplies, even the mainboard) before accepting that the chassis was wrong.
Just mentioned because the crashes are sort of similar, although for that other host, we did see things on the logs, right now we are sorta blind.

Thanks for the help - let me know if you'd need me to try to collect more things for you.

@Marostegui as it turns out while in the process of trying to submit a ticket into dell, this server has lost its warrenty as of Feburary 1st of this year. I will continue to look into this server to see what we may be able to do.

@Marostegui as it turns out while in the process of trying to submit a ticket into dell, this server has lost its warrenty as of Feburary 1st of this year. I will continue to look into this server to see what we may be able to do.

Oh, just by a few months. Let me know if you need anything from me.
If we don't find the root cause for this crash and it keeps being unreliable, we may simply decommission it.

I'm going to look into CPU2 as thats what some of the logs are pointing to.

@Marostegui I have swapped the CPUs in their sockets. I noticed this in the report

Record: 58
Date/Time: 05/28/2026 15:06:07
Source: system
Severity: Critical
Description: CPU 2 machine check error detected.

Would you be willing to try it again? I'm curious to see if the error follows. If so, then it's a bad CPU and I can check to see if we have any spares in the meantime.

db1224 should be back up, it is showing the login screen

@Marostegui I have swapped the CPUs in their sockets. I noticed this in the report

Record: 58
Date/Time: 05/28/2026 15:06:07
Source: system
Severity: Critical
Description: CPU 2 machine check error detected.

Would you be willing to try it again? I'm curious to see if the error follows. If so, then it's a bad CPU and I can check to see if we have any spares in the meantime.

Great finding! I'll give it another go and will keep you posted
Thank you so much!

I've started mariadb and replication and at the same time I am going to leave a CPU stress test for the whole weekend to see what the host does.

Host crashed after a few minutes stressing its CPU:

-------------------------------------------------------------------------------
Record:      101
Date/Time:   06/05/2026 05:30:56
Source:      system
Severity:    Critical
Description: CPU 2 temperature is greater than the upper critical threshold.
-------------------------------------------------------------------------------
Record:      102
Date/Time:   06/05/2026 05:31:41
Source:      system
Severity:    Ok
Description: CPU 2 temperature is within range.
-------------------------------------------------------------------------------
Record:      103
Date/Time:   06/05/2026 05:32:41
Source:      system
Severity:    Critical
Description: CPU 2 temperature is greater than the upper critical threshold.
-------------------------------------------------------------------------------
Record:      104
Date/Time:   06/05/2026 05:33:16
Source:      system
Severity:    Ok
Description: CPU 2 temperature is within range.
-------------------------------------------------------------------------------

Hey @Marostegui, as it turns out, I am not able to find a compatible processor for this unit. Should we commence with the removal of this unit?

Hey @Marostegui, as it turns out, I am not able to find a compatible processor for this unit. Should we commence with the removal of this unit?

@VRiley-WMF we are going to soon decommission the following hosts:
db1176
db1178
db1160
db1184
db1177
db1180
db1181
db1184

Do you have a way to check if any of them could have a compatible one?
If so, I can decommission one of them today or tomorrow.

Thanks!

@VRiley-WMF did you have some time to check if any of the above hosts could work to get some pieces to replace the ones failing on db1224?
Thanks!

Checking now, sorry I was away on vacation

@Marostegui

So, currently, the CPU in DB1224 is an Intel Xeon Gold 5317 (Server is R650xs), I checked the other units and they are all Dell R440's with Intel Gold 5217, which do not have compatible socket types. Are there any other units that we could check, or would you like us to look into procuring another processor?

@Marostegui

So, currently, the CPU in DB1224 is an Intel Xeon Gold 5317 (Server is R650xs), I checked the other units and they are all Dell R440's with Intel Gold 5217, which do not have compatible socket types. Are there any other units that we could check, or would you like us to look into procuring another processor?

I hope you had a nice vacation!
Those are the only hosts we are going to decommission soon so that's bad luck.

I'll prepare the decommissioning of this host then
Thanks for your help!!