Page MenuHomePhabricator

ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment
Closed, ResolvedPublic

Description

ms-be1060 has crashed - looking at the system log in the iDRAC,
"A fatal error was detected on a component at bus 24 device 0 function 0." at 20:11:00 on Sunday 27th, followed by
"A fatal error was detected on a component at bus 23 device 2 function 0." at 20:11:01

These repeat a number of times (alongside "An OEM diagnostic event occurred.") until "System BIOS has halted." at 23:32:53.

The system is currently failed in the POST -

ms-be1060-sadness.png (1×1 px, 225 KB)

Could you take a look at this system, please? It's currently broken, so you can work on it at any time.

Event Timeline

Jclark-ctr claimed this task.
Jclark-ctr subscribed.

@MatthewVernon I reseated the PCI RAID card and updated the BIOS, and the iDRAC error seems to have cleared. If it returns, we might need to consider ordering a new RAID card, as this system is out of warranty.

Error came back reopened ticket

@wiki_willy @RobH looks like this raid card has failed Can we get a new one ordered with a new battery?

out of curiosity: are we replacing this hardware anyways since it's almost 5 years old?

Notes:

  • System warranty ended on October 27, 2023 (3 years after purchase)
  • 5 year life projection says this should be replaced in October 2025 (Q2 next fiscal)

Any quotation for a raid card will include pricing, so I'll have to make a sub-task for that. I imagine that folks won't want to make the call between an early system replacement or raid card replacement without pricing each out, so the procurement sub task can have that info.

@wiki_willy @RobH looks like this raid card has failed Can we get a new one ordered with a new battery?

Do we have another card in a decom host we could slap in to cover this?

RobH mentioned this in Unknown Object (Task).Apr 29 2025, 4:54 PM
RobH added a subtask: Unknown Object (Task).

@Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you check with @MatthewVernon to see if any of those are close to being decommissioned, and see if we can pull the RAID card from one of those machines?

Sorry, nevermind....it looks like they're HPs

@Jclark-ctr - it looks like we refreshed ms-be105[1-9] towards the end of last year via T371389. Can you check with @MatthewVernon to see if any of those are close to being decommissioned, and see if we can pull the RAID card from one of those machines?

@RobH this is a 740xd2 we have not had any of these decom yet

Removed the BBU from the RAID card. After letting the server sit for 10 minutes without the BBU, I reinstalled it. It seems to be working fine now. Previously, it would fail immediately after booting, so this is a promising sign. Will monitor and check again tomorrow.

Change #1140121 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Swift: mark ms-be1060 as failed

https://gerrit.wikimedia.org/r/1140121

Change #1140121 merged by MVernon:

[operations/puppet@production] Swift: mark ms-be1060 as failed

https://gerrit.wikimedia.org/r/1140121

Hi,

it's crashed again, after about an hour as far as I can tell (23:13:14 UTC).

@wiki_willy this node is currently slated for replacement in Q2 as part of "Refresh of ms-be10[60-63]"; depending on costs/timelines of getting a replacement card in, could we pull that forward to Q1?

Alternatively, we are due to cycle out thanos-fe100[1-3] this quarter (hardware being ordered in T389837), which are R740xd rather than R740xd2 - is the RAID card likely to be compatible @Jclark-ctr ? But that kit hasn't yet been ordered, and will take several weeks to drain from the thanos rings, at which point it might still be better to just pull the new-hardware-order to the front end of Q1 if possible? [again, this depends a bit on how much a new RAID controller would be]

Mentioned in SAL (#wikimedia-operations) [2025-04-30T08:33:11Z] <Emperor> ms-be1060 T392796 /usr/local/bin/swift_ring_manager -o /var/cache/swift_rings --doit --skip-dispersion-check --skip-replication-check --immediate-only -v

Change #1140130 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove ms-be1060 entirely

https://gerrit.wikimedia.org/r/1140130

Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh to this quarter instead. @RobH - can you create a Phabricator task and quote for Matthew to review?

@wiki_willy this node is currently slated for replacement in Q2 as part of "Refresh of ms-be10[60-63]"; depending on costs/timelines of getting a replacement card in, could we pull that forward to Q1?

RobH mentioned this in Unknown Object (Task).Apr 30 2025, 6:30 PM
RobH added a subtask: Unknown Object (Task).

Please note we have two open procurement requests for this host. Please do NOT discuss pricing on this public hardware failure task, instead keep pricing discussions to the below sub-tasks.

T392930 - pricing to replace the raid controller - quote from Dell still pending
T393046 - pricing to replace the entire server half a fiscal. year early

Once we get the hardware raid controller pricing back, I imagine we'll be able to make a determination (with @MatthewVernon and @wiki_willy) on which way to go.

RobH closed subtask Unknown Object (Task) as Resolved.May 1 2025, 8:13 PM

@MatthewVernon thanos-fe100[1-3] are R440's but no the XD2 servers use a 730mini raid card. no other servers i have seen in eqiad use those Cards

@MatthewVernon,

Please note that we've ordered 4 new hosts to replace ms-be10[60-63], but those won't arrive for a couple of weeks. Can this service handle the permanent loss of ms-be1060 (and its data) and be replaced when the new host(s) arrive?

If that is acceptable, I'd suggest we create a decom task (can use this form: https://phabricator.wikimedia.org/maniphest/task/edit/form/52/ ) and resolve this task as declined.

Change #1140130 merged by MVernon:

[operations/puppet@production] swift: remove ms-be1060 entirely

https://gerrit.wikimedia.org/r/1140130

Change #1143118 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] hiera: remove ms-be1060 from swift storagehosts

https://gerrit.wikimedia.org/r/1143118

Change #1143118 merged by MVernon:

[operations/puppet@production] hiera: remove ms-be1060 from swift storagehosts

https://gerrit.wikimedia.org/r/1143118

cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: ms-be1060.eqiad.wmnet

  • ms-be1060.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

@RobH I think the above cookbook failure is expected given this host is too broken to boot reliably, but it does mean the disks in it will still have data on.

@RobH I think the above cookbook failure is expected given this host is too broken to boot reliably, but it does mean the disks in it will still have data on.

When we decom a system, its disks are removed and stored in our secure storage on-site until an annual disk shred where we have a service come on-site and our engineers supervise them shredding the physical media (in a heavy duty industrial metal shredder it is very impressive). So if this was just an attempt to clear data, no worries they'll be destroyed!

VRiley-WMF subscribed.

This unit has been decommed. We will ensure these disks are certainly shredded.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.May 19 2025, 8:58 PM