Page MenuHomePhabricator

cloudvirt1038: PCIe error
Closed, ResolvedPublic

Description

I detected this when rebooting the server:

Enumerating Boot options...
Enumerating Boot options... Done

UEFI0067: A PCIe link training failure is observed in Slot1 and the link is
disabled.
Do one of the following: 1) Turn off the input power to the system and turn on
again. 2) Update the PCIe device firmware. If the issue persists, contact your
service provider.
 

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for Lifecycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

The server wont boot.

DC Ops troubleshooting

I've not seen this, but unless it happens twice it doesn't count! So, steps to fix:

cloudvirt1038 is located in D5:U14

  • onsite unplugs all power, fully depowering the system. Then plug back in after 15-30 seconds and attempt to boot the system. - did not fix error
  • update firmware: idrac, bios, network card (only item on pcie bus) - did not fix error
  • - open task with dell support, have them dispatch a tech (see note below)

tech note: This error will require some troubleshooting with the network card, the PCIe riser, and the mainboard. This can be done by our own on-sites, or we can leverage our warranty coverage to have Dell send out a technician. Trying to do this via self dispatch will be cumbersome, as its unclear where to start. The options are to have an on-site self dispatch parts and work with Dell support via email to determine what parts and how to fix, or to have an on-site call into dell support and schedule a dell technician to come out and work on this in warranty host to determine what part is wrong.

When the system is booting successfully again, please change the 'failed' state in netbox back to 'active'.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T12:52:32Z] <arturo> cloudvirt1038 hard powerdown / powerup for T276922

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T13:31:44Z] <arturo> hard-reboot tools-docker-registry-04 because issues related to T276922

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T13:32:03Z] <arturo> hard-reboot deployment-db05 because issues related to T276922

Mentioned in SAL (#wikimedia-cloud) [2021-03-09T13:35:02Z] <arturo> icinga-downtime cloudvirt1038 for 30 days for T276922

aborrero triaged this task as High priority.
aborrero moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.
aborrero added projects: DC-Ops, ops-eqiad.
aborrero added a subscriber: RobH.

Hey @RobH does this sound like something we have seen before in our fleet?

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

I've not seen this, but unless it happens twice it doesn't count! So, steps to fix:

cloudvirt1038 is located in D5:U14

  • onsite unplugs all power, fully depowering the system. Then plug back in after 15-30 seconds and attempt to boot the system.
  • If that works, resolve this task as it only happened once and thus isn't worth spending more time on.
  • If that does not work, you can reassign this task back to Rob and he'll flash firmware updates across the host remotely.
  • firmware update

When the system is booting successfully again, please change the 'failed' state in netbox back to 'active'.

Cmjohnson added a subscriber: Cmjohnson.

Robh: didn't work, try updating f/w please.

Ok, Updating the firmware in this order: idrac, network card (only pcie device installed), bios.

RobH updated the task description. (Show Details)
RobH removed a subscriber: RobH.

Updated firmwares, but error persists.

Bios now 2.10.0
idrac now 4.40.00
nic now 21.60.22.11 & 21.60.16

I've updated the checklist on the task description, along with suggested next steps:

  • - open task with dell support, have them dispatch a tech (see note below)

tech note: This error will require some troubleshooting with the network card, the PCIe riser, and the mainboard. This can be done by our own on-sites, or we can leverage our warranty coverage to have Dell send out a technician. Trying to do this via self dispatch will be cumbersome, as its unclear where to start. The options are to have an on-site self dispatch parts and work with Dell support via email to determine what parts and how to fix, or to have an on-site call into dell support and schedule a dell technician to come out and work on this in warranty host to determine what part is wrong.

I'm assigning this back to Chris since the next steps all require on-site coordination (direct troubleshooting or scheduling a dell tech).

I left the host powered off.

Screen Shot 2021-03-09 at 12.08.32 PM.png (2×2 px, 900 KB)

I see this server is down and will need to go through the process of removing all the components and adding them back 1 by 1 until I can figure out what is causing the error. This will happen this week and I will update the task once I know more.

So far I have found out nothing
This is a PCI error, in the past, this would mean a blown capacitor but this is the first I've seen of this error since we left Tampa. I attempted to remove and replace the riser cards one at a time but that did not work. I removed the NIC card since that is in a PCI slot and that did not work. My guess is the motherboard needs to be replaced. Dell's system log doesn't show an error. I need to submit a ticket to Dell and will update the task once I get a response.

Dell Ticket Created

You have successfully submitted request SR1055971142.

the Dell tech came out and replaced the motherboard, that did not fix the issue, it turns out that there is bad cable to the backplane. A new part has been ordered.

another Dell tech arrived today with what was believed to be the replacement part. The part was replaced and the error persisted. Several reboots and TSR reports later, we do not know what is going on. At what point in time reseating the CMOS battery worked but then the PCI error returned on the next reboot. The Dell technician is still on-site attempting to troubleshoot with Dell tech support now. Once something has been decided I will update the task.

Dell has us on a wild goose hunt. Responded to their questions with the following:

  1. Has this system ever had a PCI card in slot 1? If so, what card? No, the system has not had a PCI card in slot 1 or any slot for that matter.
  1. Has this system ever had a PERC / backplane drives / storage other than the BOSS card? No, additional raid controllers or backplane drives or any other storage has been added. We have multiples of the same server with the same configuration and this is the only server that has given this type of error.

3.Did the fan issue and VGA issue start after the system board was replaced? I cannot answer that question because I am not the one who replaced the board. I know that the original technician replaced it and the error persisted, he was there for several hours and was able to finally get the system to get through post and boot to the hard drive. That was when it was determined it was a riser card. Technician 2 came to replace the card and the error returned. This is who you talked to the other day.

Dell sent an email with a list of things they want to be done, considering that they've had 2 technicians out to fix the issue with zero resolution, I replied that they will need to send one of their technicians out to perform these tasks. I do not feel it is wise to start poking around myself in case we need to RMA this server.

Thanks for the update! I agree that we should leave this in Dell's hands as much as possible. I'm not impatient to have it back online so do whatever you need to to get them to take responsibility.

Dell is supposed to be here today to replace several more parts. We will see how it goes

Mentioned in SAL (#wikimedia-operations) [2021-05-12T10:46:33Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on cloudvirt1038.eqiad.wmnet with reason: T276922

Mentioned in SAL (#wikimedia-operations) [2021-05-12T10:46:36Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on cloudvirt1038.eqiad.wmnet with reason: T276922

The last Dell tech that came in identified the problem as a riser card, Oddly enough this was replaced already but maybe the second time is the charm. Dell is sending the part directly to me and I will replace and fingers crossed this works

Received a new PCI card and the error returned immediately. I disabled the PCI-E slot 1 and the server boots fine. I do not see any need for that riser in the future. The server is back and ready for you, still marked failed in Netbox. I can continue to chase this but so far the motherboard has been replaced twice, the backplane was changed, the PCI-E card twice and 1 DIMM. Let me know you want to proceed. Also pinging @wiki_willy to see if he wants to get involved.

Dell is asking for more logs, this will not be a quick process

Dell wants to take the server down to the minimum post. I asked that they send a dell technician to do go down this rabbit hole again.

No update yet, looking at returning the server still

@Cmjohnson Are we still looking at returning this machine? Has an RMA been started?

Haven't had an update in a while from them, I just pinged them again.

Just got off the phone with Dell. It's escalated on their side, and they're going to sync up tomorrow in figuring out a solution for this, which could very well end up being a new replacement server. Since this server seems to work fine when the PCI card slot 1 is disabled though (and the PCI card doesn't seem to be used for anything), would WMCS be ok if we just left the bios settings that way? Thanks, Willy

@wiki_willy If a server replacement is likely, we can wait until Dell resolves the issue completely. We would prefer not to put a potentially failing server back into service. In other words, we can wait for a full resolution if it's easier for you and team.

Ok, thanks @nskaggs. They're currently processing a server replacement. Simultaneously, especially with the long lead times for new servers, there's one more suggestion that the Dell Support team has in troubleshooting for a perm fix. @Cmjohnson - they'll reach out to you early next week when you're back onsite, to try it out. Thanks, Willy

wiki_willy added a subscriber: Jclark-ctr.

Hi @Jclark-ctr - it looks like Chris going to be out for a while. Dell has one last suggestion in figuring out a solution for this ticket, while the replacement server is being ordered. With the server lead time delays, it will be months before arriving. So, when you're back next Tue/Wed, can you with the Dell Support team (I'm asking them to reach out to you) on implementing their proposed solution? Thanks, Willy

I have not received any calls or emails from dell yet. only emails received where from previous motherboard replacement with Chris

zoom meeting with Dell Tech support provided updated tsr report. system log shows error since last date of service but no current error. running hardware test 3-4 hours will update dell with new tsr report after it finishes

@nskaggs we have gone through steps with dell and preformed hardware test looks to be operational now with no more errors if you can put back in service i would like to leave ticket open for a week if anything comes back

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108101958_andrew_2769.log.

I now have a canary VM running on this host but it is not actually in the scheduling pool yet. We'll see how it does!

Mentioned in SAL (#wikimedia-cloud) [2021-08-18T14:47:22Z] <andrewbogott> adding clouvirt1038 to the ceph aggregate, removing from the maintenance aggregate T276922

This host is now pooled and presumed fixed. Thanks all!

I just noticed that this server is still marked as 'failed' in netbox; shall I switch it back to 'active'?